Logs: liberachat/#haskell
| 2021-06-16 12:59:42 | × | hello20 quits (~hello@cpc97208-walt22-2-0-cust196.13-2.cable.virginm.net) (Ping timeout: 268 seconds) |
| 2021-06-16 13:00:01 | → | chomwitt joins (~Pitsikoko@athedsl-20549.home.otenet.gr) |
| 2021-06-16 13:01:15 | → | alx741 joins (~alx741@186.178.108.66) |
| 2021-06-16 13:01:29 | × | y04nn quits (~y04nn@81.17.24.204) (Ping timeout: 252 seconds) |
| 2021-06-16 13:02:32 | → | eggplantade joins (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) |
| 2021-06-16 13:03:11 | × | andreas303 quits (~andreas@gateway/tor-sasl/andreas303) (Quit: andreas303) |
| 2021-06-16 13:07:13 | × | eggplantade quits (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) (Ping timeout: 268 seconds) |
| 2021-06-16 13:08:07 | → | raek joins (~raek@2001:9b1:efe:3200:d250:99ff:fec0:e153) |
| 2021-06-16 13:08:20 | × | shredder quits (~shredder@user/shredder) (Ping timeout: 268 seconds) |
| 2021-06-16 13:09:01 | → | bmo joins (~bmo@185.209.196.142) |
| 2021-06-16 13:10:09 | → | henninb joins (~user@63.226.174.157) |
| 2021-06-16 13:10:18 | × | krjst quits (~krjst@2604:a880:800:c1::16b:8001) (Quit: bye) |
| 2021-06-16 13:11:44 | → | crazazy joins (~user@130.89.171.203) |
| 2021-06-16 13:12:17 | ← | henninb parts (~user@63.226.174.157) () |
| 2021-06-16 13:12:54 | × | obs\ quits (~obscur1ty@102.41.69.204) (Quit: Leaving) |
| 2021-06-16 13:13:10 | → | obs\ joins (~obscur1ty@102.41.69.204) |
| 2021-06-16 13:14:24 | × | ukari quits (~ukari@user/ukari) (Remote host closed the connection) |
| 2021-06-16 13:15:05 | → | ukari joins (~ukari@user/ukari) |
| 2021-06-16 13:15:18 | → | ddellacosta joins (~ddellacos@86.106.121.100) |
| 2021-06-16 13:17:12 | × | obs\ quits (~obscur1ty@102.41.69.204) (Changing host) |
| 2021-06-16 13:17:12 | → | obs\ joins (~obscur1ty@user/obs/x-5924898) |
| 2021-06-16 13:17:22 | × | cheater quits (~Username@user/cheater) (Remote host closed the connection) |
| 2021-06-16 13:18:43 | → | zebrag joins (~chris@user/zebrag) |
| 2021-06-16 13:18:55 | → | krjst joins (~krjst@2604:a880:800:c1::16b:8001) |
| 2021-06-16 13:20:03 | × | ddellacosta quits (~ddellacos@86.106.121.100) (Ping timeout: 268 seconds) |
| 2021-06-16 13:20:15 | × | kayprish quits (~kayprish@46.240.143.86) (Remote host closed the connection) |
| 2021-06-16 13:20:58 | × | krjst quits (~krjst@2604:a880:800:c1::16b:8001) (Client Quit) |
| 2021-06-16 13:21:43 | → | cheater joins (~Username@user/cheater) |
| 2021-06-16 13:22:14 | → | shapr joins (~user@pool-100-36-247-68.washdc.fios.verizon.net) |
| 2021-06-16 13:22:44 | → | eggplantade joins (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) |
| 2021-06-16 13:23:26 | × | Guest9 quits (~Guest9@103.250.139.6) (Quit: Connection closed) |
| 2021-06-16 13:24:26 | × | jumper149 quits (~jumper149@80.240.31.34) (Ping timeout: 244 seconds) |
| 2021-06-16 13:24:32 | × | psydroid quits (~psydroidm@2001:470:69fc:105::165) (Changing host) |
| 2021-06-16 13:24:32 | → | psydroid joins (~psydroidm@user/psydroid) |
| 2021-06-16 13:25:22 | → | krjst joins (~krjst@2604:a880:800:c1::16b:8001) |
| 2021-06-16 13:25:50 | → | sbmsr joins (~pi@104-6-130-18.lightspeed.miamfl.sbcglobal.net) |
| 2021-06-16 13:27:01 | × | eggplantade quits (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) (Ping timeout: 244 seconds) |
| 2021-06-16 13:28:22 | → | AgentM joins (~agentm@pool-162-83-130-212.nycmny.fios.verizon.net) |
| 2021-06-16 13:28:34 | × | cheater quits (~Username@user/cheater) (Ping timeout: 244 seconds) |
| 2021-06-16 13:29:05 | → | cheater joins (~Username@user/cheater) |
| 2021-06-16 13:32:17 | × | haltux quits (~haltux@a89-154-181-47.cpe.netcabo.pt) (Ping timeout: 252 seconds) |
| 2021-06-16 13:32:52 | <bmo> | What is a good approach of parsing (very large) XML files? I started off with using xml-conduit as it seems a good suit but maybe I am wrong |
| 2021-06-16 13:33:00 | × | azeem quits (~azeem@176.201.22.245) (Ping timeout: 268 seconds) |
| 2021-06-16 13:33:06 | <bmo> | I currently have a minor problem with that: paste.tomsmeding.com/WyuOXLLK |
| 2021-06-16 13:33:14 | → | azeem joins (~azeem@176.201.43.174) |
| 2021-06-16 13:33:34 | × | raehik quits (~raehik@cpc95906-rdng25-2-0-cust156.15-3.cable.virginm.net) (Quit: WeeChat 3.1) |
| 2021-06-16 13:34:39 | → | raehik joins (~raehik@cpc95906-rdng25-2-0-cust156.15-3.cable.virginm.net) |
| 2021-06-16 13:34:54 | <bmo> | So basically I cannot assume an order on the xml-tags. With that example an entry consists of `persons` (multiple fields that contain a `Text`) and `title` which is `Text` too. The naive way of just parsing `persons` first and then the `title` breaks as soon as the XML is not following the same order (duh) |
| 2021-06-16 13:35:32 | <bmo> | Is there an elegant way of parsing such XML without breaking down my `Entry`'s fields, parsing them first and then re-order+validate? |
| 2021-06-16 13:38:00 | <bmo> | In that small example my current approach works for `bs0` but for `bs1` it breaks as `title` precedes the `person`s (the might actually be interleaved in reality, so `<person>...<title>...<person>...` etc.)) |
| 2021-06-16 13:39:15 | → | nschoe joins (~quassel@2a01:e0a:8e:a190:4dc0:5be8:9ad8:a5a4) |
| 2021-06-16 13:39:56 | × | jakzale quits (uid499518@id-499518.charlton.irccloud.com) (Quit: Connection closed for inactivity) |
| 2021-06-16 13:40:52 | → | waleee joins (~waleee@2001:9b0:216:8200:d457:9189:7843:1dbd) |
| 2021-06-16 13:41:23 | → | benin036 joins (~benin@183.82.207.180) |
| 2021-06-16 13:41:45 | × | dunkeln quits (~dunkeln@94.129.65.28) (Ping timeout: 268 seconds) |
| 2021-06-16 13:41:52 | <shapr> | Anyone want to suggest improvements to https://github.com/shapr/takedouble/blob/main/src/Takedouble.hs#L71 and the saneFile function below? |
| 2021-06-16 13:42:06 | <shapr> | I feel like there's a better and/or simpler approach to that. |
| 2021-06-16 13:42:36 | → | dunkeln joins (~dunkeln@94.129.65.28) |
| 2021-06-16 13:43:56 | <dminuoso> | bmo: AttrParser is an Alternative, so you can use this https://hackage.haskell.org/package/parser-combinators-1.3.0/docs/Control-Monad-Permutations.html |
| 2021-06-16 13:44:13 | × | aplainzetakind quits (~johndoe@captainludd.powered.by.lunarbnc.net) (Ping timeout: 268 seconds) |
| 2021-06-16 13:45:45 | <dminuoso> | Im a bit surprised, does AttrParser not do this for you already? |
| 2021-06-16 13:46:49 | → | Tuplanolla joins (~Tuplanoll@91-159-68-239.elisa-laajakaista.fi) |
| 2021-06-16 13:47:07 | <dminuoso> | Judging from the implementation, the order shouldn't matter. |
| 2021-06-16 13:47:28 | <bmo> | dminuoso, actually the attributes are valid up to permutation true. I haven't noticed. |
| 2021-06-16 13:47:36 | → | aplainzetakind joins (~johndoe@captainludd.powered.by.lunarbnc.net) |
| 2021-06-16 13:48:05 | <dminuoso> | So when you said "breaks", is that what you think it happens? |
| 2021-06-16 13:48:08 | <dminuoso> | Have you actually tried it? |
| 2021-06-16 13:48:29 | <bmo> | But my problem is with actual tags. So I have `<e> <p>x</p> <t>y</t> </e>` but sometimes `<e> <t>y</t> <p>x</p> </e>` |
| 2021-06-16 13:49:08 | <bmo> | dminuoso, I just tested permuting the attributes and that works. But the permuted tags don't which, in hindsight, is expected |
| 2021-06-16 13:49:12 | → | eggplantade joins (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) |
| 2021-06-16 13:49:13 | <dminuoso> | Ahh |
| 2021-06-16 13:49:15 | <shapr> | Hm, I think I'll convert the "get all files in all subdirectories" function into something that could run in a bunch of threads, just to see if that's faster. |
| 2021-06-16 13:49:29 | <dminuoso> | bmo: Yeah I dont think permutation on tags can reasonably work in conduit-xml |
| 2021-06-16 13:49:34 | <shapr> | I've read NVMe drives work best with a deep queue of requests |
| 2021-06-16 13:49:42 | <maerwald> | shapr: threading over filesystem operations? :> |
| 2021-06-16 13:49:52 | <dminuoso> | bmo: For starters, what does "permutation" even mean? A naive take on XML is that it's a tree. |
| 2021-06-16 13:50:38 | → | muto joins (~muto@d75-159-225-7.abhsia.telus.net) |
| 2021-06-16 13:50:46 | <shapr> | maerwald: yeah, I think it could speed up reading a bunch of files to check for duplicates |
| 2021-06-16 13:51:10 | <shapr> | maerwald: I'm also slowly working my way towards this kind of thing: https://www.tbray.org/ongoing/When/202x/2021/03/27/Topfew-and-Amdahl |
| 2021-06-16 13:51:16 | × | oo_miguel quits (~pi@89-72-187-203.dynamic.chello.pl) (Quit: WeeChat 2.3) |
| 2021-06-16 13:51:19 | × | sbmsr quits (~pi@104-6-130-18.lightspeed.miamfl.sbcglobal.net) (Ping timeout: 272 seconds) |
| 2021-06-16 13:51:20 | <bmo> | Well within an `<e>` (I'm just using abbreviations of that example I gave) "fields" are sometimes permuted, ie. not in a particular order |
| 2021-06-16 13:51:28 | <shapr> | that is, a count min sketch on top of Apache logs |
| 2021-06-16 13:51:40 | <bmo> | Luckily the leaves in such an `<e>` are always small, so the wouldn't nest further. |
| 2021-06-16 13:51:59 | <shapr> | maerwald: at least for that post, reading multiple pieces of a large file in different threads was faster |
| 2021-06-16 13:52:23 | → | ddellacosta joins (~ddellacos@86.106.121.100) |
| 2021-06-16 13:52:48 | <dminuoso> | bmo: You can <|> NameMatchers together |
| 2021-06-16 13:53:42 | × | eggplantade quits (~Eggplanta@2600:1700:bef1:5e10:cded:c7cb:4d63:a64a) (Ping timeout: 264 seconds) |
| 2021-06-16 13:54:21 | <dminuoso> | bmo: It seems you'd have to do something along these lines: |
| 2021-06-16 13:54:56 | <bmo> | dminuoso, so I'd have to (with xml-conduit that is) parse the tags into something isomorphic to `data ELeave = P Text | T Text` and then re-order+validate once I parsed all of `<e>`'s leaves? |
| 2021-06-16 13:56:28 | × | ddellacosta quits (~ddellacos@86.106.121.100) (Ping timeout: 244 seconds) |
| 2021-06-16 13:58:33 | <dminuoso> | bmo: Something along the lines of: data Ki = Ent | Per | Tit; isEntry :: A -> Maybe Ki; isPerson :: A -> Maybe Ki; tag (isEntry <|> isPerson) (\case of Ent -> ...; Per -> ...; Tit -> ...) |
| 2021-06-16 13:58:50 | <dminuoso> | This will become very awkward to write I think |
| 2021-06-16 13:59:10 | <dminuoso> | Since you then have to keep track what kind of element you have consumed |
| 2021-06-16 13:59:12 | <bmo> | Yeah :( I kinda wanted to avoid this somehow |
| 2021-06-16 14:00:06 | <bmo> | Especially since that is a small example and the real thing is quite a bit bigger |
| 2021-06-16 14:01:21 | <dminuoso> | bmo: have you considered tagsoup perhaps? |
| 2021-06-16 14:02:12 | <bmo> | No, so far I only considered xml-conduit and had a quick look at how I can use DtdToHaskell with HaXml but conduit seemed simpler. |
| 2021-06-16 14:02:40 | <bmo> | I was not aware of tagsoup, I'll have a look at it. Thanks a lot for your assistance! |
| 2021-06-16 14:02:52 | <dminuoso> | with tagsoup you can convert it straight into a plain tree, that might be much easier to work with for you |
All times are in UTC.