Logs: freenode/#haskell
| 2020-11-17 15:21:49 | <nut> | ah, then it could be character, let me check |
| 2020-11-17 15:21:50 | <merijn> | You can read it as ByteString, do byte indexing on that, then selectively decode text starting from an offset |
| 2020-11-17 15:22:02 | <merijn> | nut: If it's character you're doomed too :) |
| 2020-11-17 15:22:05 | → | cosimone joins (~cosimone@2001:b07:ae5:db26:d849:743b:370b:b3cd) |
| 2020-11-17 15:22:07 | <dolio> | Once it's in Text all the offsets could be wrong anyway. |
| 2020-11-17 15:23:12 | <nut> | so for a Data.Text string, there's no way to move some kind of pointer within the string right? |
| 2020-11-17 15:23:35 | <merijn> | nut: You can index Text "by codepoint", maybe |
| 2020-11-17 15:24:08 | → | Entertainment joins (~entertain@104.246.132.210) |
| 2020-11-17 15:24:22 | <nut> | so basically fseek equivalent |
| 2020-11-17 15:24:38 | <merijn> | nut: The real, honest answer is that: in every single programming language indexing strings is a broken clusterfuck you cannot rely on to do anything sensible (even though it may appear to do something sensible if you only ever look at ascii) |
| 2020-11-17 15:25:19 | → | conal joins (~conal@64.71.133.70) |
| 2020-11-17 15:25:22 | → | nados joins (~dan@69-165-210-185.cable.teksavvy.com) |
| 2020-11-17 15:25:41 | <nut> | The offset idea does seem efficient. Without it, how do Haskell manage quick lookup? |
| 2020-11-17 15:26:02 | × | SanchayanMaity quits (~Sanchayan@106.201.35.233) (Quit: leaving) |
| 2020-11-17 15:26:20 | <merijn> | nut: Like I said, if the offset is in bytes you can easily read a bytestring and index that and then decode to Text "on demand" |
| 2020-11-17 15:26:21 | → | SanchayanMaity joins (~Sanchayan@106.201.35.233) |
| 2020-11-17 15:26:34 | <nut> | ok I see |
| 2020-11-17 15:26:46 | <nut> | So i'll use the bytestring package instead of text |
| 2020-11-17 15:26:54 | → | Deide joins (~Deide@217.155.19.23) |
| 2020-11-17 15:26:59 | → | britva joins (~britva@31-10-157-156.cgn.dynamic.upc.ch) |
| 2020-11-17 15:27:02 | × | conal quits (~conal@64.71.133.70) (Read error: Connection reset by peer) |
| 2020-11-17 15:27:15 | → | is_null joins (~jpic@pdpc/supporter/professional/is-null) |
| 2020-11-17 15:27:27 | <nut> | You gave me the hint to use text instead of bytestring a few hours ago before i went to the dentist |
| 2020-11-17 15:27:28 | <merijn> | nut: More practically for a deictionary I'd just read in the entire thing and create a Map |
| 2020-11-17 15:27:35 | × | SanchayanMaity quits (~Sanchayan@106.201.35.233) (Client Quit) |
| 2020-11-17 15:27:45 | → | conal joins (~conal@64.71.133.70) |
| 2020-11-17 15:27:50 | → | SanchayanMaity joins (~Sanchayan@106.201.35.233) |
| 2020-11-17 15:28:27 | <nut> | merijn: that would mean in memory lookup |
| 2020-11-17 15:28:39 | <nut> | merijn: How would you then serialize the thing? |
| 2020-11-17 15:28:55 | × | SanchayanMaity quits (~Sanchayan@106.201.35.233) (Client Quit) |
| 2020-11-17 15:29:03 | × | da39a3ee5e6b4b0d quits (~da39a3ee5@cm-171-98-79-192.revip7.asianet.co.th) (Ping timeout: 265 seconds) |
| 2020-11-17 15:29:12 | → | SanchayanMaity joins (~Sanchayan@106.201.35.233) |
| 2020-11-17 15:29:38 | <merijn> | nut: I'd just write the entire thing to disk at once and read it in at once |
| 2020-11-17 15:30:06 | <merijn> | Rather than dynamically indexing an open file. You *can* dynamically index the file, but that doesn't seem worth it unless it's truly massive |
| 2020-11-17 15:30:56 | <nut> | Most dictionary files I;ve seem have some sofisticated file formate |
| 2020-11-17 15:31:04 | → | Guest_85 joins (5181d645@host81-129-214-69.range81-129.btcentralplus.com) |
| 2020-11-17 15:31:22 | <nut> | Such as the stardcit file formate or dictd.org |
| 2020-11-17 15:31:57 | → | bitmapper joins (uid464869@gateway/web/irccloud.com/x-asjzblgwwtdcvjsz) |
| 2020-11-17 15:32:18 | <nut> | It's not massive, a few hundred M only. But I want to find out for the sake of learning |
| 2020-11-17 15:32:23 | <merijn> | nut: Ah, but *that* sounds more like a different question, that sounds like "how would I parse complicated/sophisticated file formats into something usable?" |
| 2020-11-17 15:33:25 | <nut> | Those file formats are design to have less disk access times and at the same time quick search time |
| 2020-11-17 15:33:53 | <merijn> | @hoogle hSeek |
| 2020-11-17 15:33:53 | <lambdabot> | System.IO hSeek :: Handle -> SeekMode -> Integer -> IO () |
| 2020-11-17 15:33:53 | <lambdabot> | GHC.IO.Handle hSeek :: Handle -> SeekMode -> Integer -> IO () |
| 2020-11-17 15:33:53 | <lambdabot> | UnliftIO.IO hSeek :: MonadIO m => Handle -> SeekMode -> Integer -> m () |
| 2020-11-17 15:33:55 | <merijn> | @hoogle hGet |
| 2020-11-17 15:33:55 | <lambdabot> | Data.ByteString hGet :: Handle -> Int -> IO ByteString |
| 2020-11-17 15:33:55 | <lambdabot> | Data.ByteString.Char8 hGet :: Handle -> Int -> IO ByteString |
| 2020-11-17 15:33:55 | <lambdabot> | Data.ByteString.Lazy hGet :: Handle -> Int -> IO ByteString |
| 2020-11-17 15:34:18 | × | SanchayanMaity quits (~Sanchayan@106.201.35.233) (Quit: leaving) |
| 2020-11-17 15:34:22 | <nut> | Indeed, at first I though there would be a Data.Text.hSeek |
| 2020-11-17 15:34:30 | → | darjeeling_ joins (~darjeelin@122.245.211.11) |
| 2020-11-17 15:34:36 | <merijn> | nut: If you open a file Handle you can use hSeek to jump to offsets to read bytes from there in the file, the same way you would in other languages |
| 2020-11-17 15:34:38 | → | SanchayanMaity joins (~Sanchayan@106.201.35.233) |
| 2020-11-17 15:34:39 | × | toorevitimirp quits (~tooreviti@117.182.180.118) (Remote host closed the connection) |
| 2020-11-17 15:34:57 | × | morbeus quits (vhamalai@gateway/shell/tkk.fi/x-sygopmpjleahuvxk) (Remote host closed the connection) |
| 2020-11-17 15:34:59 | <merijn> | nut: You might also be interested in: |
| 2020-11-17 15:35:01 | <merijn> | @hackage binary |
| 2020-11-17 15:35:01 | <lambdabot> | https://hackage.haskell.org/package/binary |
| 2020-11-17 15:35:19 | <merijn> | nut: Which is a library for decoding ByteString into custom data |
| 2020-11-17 15:35:39 | <merijn> | @hackage attoparsec |
| 2020-11-17 15:35:39 | <lambdabot> | https://hackage.haskell.org/package/attoparsec |
| 2020-11-17 15:36:28 | <dolio> | You can just use the hSeek from base. Text doesn't need to provide its own. |
| 2020-11-17 15:36:52 | <merijn> | dolio: Of course hSeek and then trying to read a String is *also* cursed :p |
| 2020-11-17 15:37:08 | <nut> | There is no hSeek from base |
| 2020-11-17 15:37:18 | <merijn> | System.IO.hSeek ? |
| 2020-11-17 15:37:24 | <nut> | at least not from Prelude |
| 2020-11-17 15:37:45 | <dolio> | Prelude doesn't export everything in base. |
| 2020-11-17 15:37:48 | <nut> | i see |
| 2020-11-17 15:39:17 | × | SanchayanMaity quits (~Sanchayan@106.201.35.233) (Client Quit) |
| 2020-11-17 15:39:28 | <merijn> | Prelude only exports a fraction of base :) |
| 2020-11-17 15:39:28 | × | kritzefitz quits (~kritzefit@fw-front.credativ.com) (Read error: Connection timed out) |
| 2020-11-17 15:43:22 | × | ericsagn1 quits (~ericsagne@2405:6580:0:5100:d6bc:df2c:ba38:451b) (Ping timeout: 260 seconds) |
| 2020-11-17 15:44:15 | × | Guest_85 quits (5181d645@host81-129-214-69.range81-129.btcentralplus.com) (Remote host closed the connection) |
| 2020-11-17 15:44:30 | hackage | hedn 0.3.0.2 - EDN parsing and encoding https://hackage.haskell.org/package/hedn-0.3.0.2 (AlexanderBondarenko) |
| 2020-11-17 15:44:49 | → | royal_screwup21 joins (52254809@gateway/web/cgi-irc/kiwiirc.com/ip.82.37.72.9) |
| 2020-11-17 15:45:17 | → | idhugo joins (~idhugo@80-62-116-101-mobile.dk.customer.tdc.net) |
| 2020-11-17 15:46:37 | × | MarcelineVQ quits (~anja@198.254.202.72) (Ping timeout: 260 seconds) |
| 2020-11-17 15:48:16 | × | Tario quits (~Tario@201.192.165.173) (Read error: Connection reset by peer) |
| 2020-11-17 15:50:09 | → | MarcelineVQ joins (~anja@198.254.202.72) |
| 2020-11-17 15:51:33 | × | Franciman quits (~francesco@host-82-56-223-169.retail.telecomitalia.it) (Quit: Leaving) |
| 2020-11-17 15:53:15 | <tomjaguarpaw> | merijn: Compact regions didn't help with my GC problem in the end because I realised my test cases are also generating large amounts of data! However, I did manage to combine your System.Mem.performGC and GHC.Stats suggestions with RTS options to good effect: https://stackoverflow.com/a/64878595/997606 |
| 2020-11-17 15:54:25 | → | knupfer joins (~Thunderbi@i59F7FFD9.versanet.de) |
| 2020-11-17 15:55:24 | → | ericsagn1 joins (~ericsagne@2405:6580:0:5100:9c16:5b76:e160:ad6d) |
| 2020-11-17 15:55:41 | → | jfredett joins (~jfredett@178.162.212.214) |
| 2020-11-17 15:56:34 | → | kritzefitz joins (~kritzefit@fw-front.credativ.com) |
| 2020-11-17 15:56:46 | × | christo quits (~chris@81.96.113.213) (Remote host closed the connection) |
| 2020-11-17 15:57:14 | → | oish joins (~charlie@228.25.169.217.in-addr.arpa) |
| 2020-11-17 15:58:51 | → | zebrag joins (~inkbottle@aaubervilliers-654-1-89-20.w86-212.abo.wanadoo.fr) |
| 2020-11-17 16:00:03 | → | carlomagno1 joins (~cararell@148.87.23.11) |
| 2020-11-17 16:00:03 | × | carlomagno quits (~cararell@148.87.23.10) (Remote host closed the connection) |
| 2020-11-17 16:00:26 | → | Rudd0 joins (~Rudd0@185.189.115.98) |
| 2020-11-17 16:02:23 | <merijn> | tomjaguarpaw: Well, to be fair,if your code is producing lots of data, then perhaps including it in your benchmarks isn't so wrong :p |
| 2020-11-17 16:04:41 | → | nuncanada joins (~dude@179.235.160.168) |
| 2020-11-17 16:05:01 | → | christo joins (~chris@81.96.113.213) |
| 2020-11-17 16:05:12 | → | Stanley00 joins (~stanley00@unaffiliated/stanley00) |
| 2020-11-17 16:05:38 | → | asthasr joins (~asthasr@162.210.29.120) |
| 2020-11-17 16:07:30 | × | sord937 quits (~sord937@gateway/tor-sasl/sord937) (Remote host closed the connection) |
| 2020-11-17 16:07:40 | → | christo_ joins (~chris@81.96.113.213) |
| 2020-11-17 16:07:42 | × | christo quits (~chris@81.96.113.213) (Read error: Connection reset by peer) |
All times are in UTC.