Logs: liberachat/#haskell
| 2026-02-14 20:17:44 | <tomsmeding> | I see |
| 2026-02-14 20:17:56 | <int-e> | (Rather than digging into the history I'll just assume this hasn't changed recently except for bumping versions.) |
| 2026-02-14 20:18:07 | → | s3np41 joins (~s3np41@078088254000.unknown.vectranet.pl) |
| 2026-02-14 20:18:17 | <tomsmeding> | do you happen to know how critical those version upper bounds are, in GHC's LLVM support? |
| 2026-02-14 20:18:38 | <int-e> | I don't |
| 2026-02-14 20:21:49 | × | merijn quits (~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 250 seconds) |
| 2026-02-14 20:22:08 | <int-e> | Hmm. For me, the baked in commands (from the settings file) have a version, e.g. llc-14 for ghc-9.10.3 and llc-19 for ghc-9.12.2. I wonder what the binary distributions put there... ghc --info | grep LLVM will show that info without you having to go looking for the settings file. |
| 2026-02-14 20:23:07 | <geekosaur> | they can be pretty critical but I don't think it's critical for recent versions |
| 2026-02-14 20:23:31 | <geekosaur> | there was a point where `opt` parameters changed and ghc didn't know how tgo call newer versions correctly |
| 2026-02-14 20:25:54 | <geekosaur> | also I mentioned the settings file bvecause I saw a claim in backscroll that the correct llvm version wasn't on their PATH, which means a settings file edit to point to the correct one |
| 2026-02-14 20:26:38 | → | ouilemur joins (~jgmerritt@user/ouilemur) |
| 2026-02-14 20:28:52 | <probie> | It's not giving me SIMD instruction :'( |
| 2026-02-14 20:29:02 | <probie> | I wonder if it's the use of `read` instead of `unsafeRead`? |
| 2026-02-14 20:29:31 | → | pavonia joins (~user@user/siracusa) |
| 2026-02-14 20:32:46 | → | merijn joins (~merijn@host-cl.cgnat-g.v4.dfn.nl) |
| 2026-02-14 20:33:37 | × | peterbecich quits (~Thunderbi@71.84.33.135) (Ping timeout: 264 seconds) |
| 2026-02-14 20:34:44 | <tomsmeding> | possibly, yes |
| 2026-02-14 20:37:41 | × | merijn quits (~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 244 seconds) |
| 2026-02-14 20:38:17 | × | caubert quits (~caubert@user/caubert) (Ping timeout: 250 seconds) |
| 2026-02-14 20:38:35 | × | Pixi quits (~Pixi@user/pixi) (Quit: Leaving) |
| 2026-02-14 20:43:03 | <[exa]> | probie: btw why not make a small FFI to a relatively portable C? |
| 2026-02-14 20:43:41 | <[exa]> | (man, can we FFI to futhark?) |
| 2026-02-14 20:43:46 | → | Pixi joins (~Pixi@user/pixi) |
| 2026-02-14 20:44:08 | <tomsmeding> | [exa]: https://gitlab.com/Gusten_Isfeldt/futhask |
| 2026-02-14 20:44:27 | <tomsmeding> | (never used it) |
| 2026-02-14 20:45:36 | → | peterbecich joins (~Thunderbi@71.84.33.135) |
| 2026-02-14 20:45:43 | <[exa]> | ok not bad :) |
| 2026-02-14 20:49:55 | → | merijn joins (~merijn@host-cl.cgnat-g.v4.dfn.nl) |
| 2026-02-14 20:52:13 | <probie> | [exa]: Because I shouldn't need to |
| 2026-02-14 20:53:21 | → | caubert joins (~caubert@user/caubert) |
| 2026-02-14 20:54:32 | × | machinedgod quits (~machinedg@d75-159-126-101.abhsia.telus.net) (Ping timeout: 252 seconds) |
| 2026-02-14 20:56:12 | <[exa]> | probie: I find it better than relying on the compiler accidentaly noticing that I want SIMD (but yeah it's still :( ) |
| 2026-02-14 20:56:37 | <tomsmeding> | GHC has SIMD primops, but they only work with LLVM |
| 2026-02-14 20:56:50 | <tomsmeding> | very recently IIRC some of them started working on NCG too |
| 2026-02-14 20:56:55 | <[exa]> | probie: btw try to unroll the loop manually, that might give llvm enough decisive force |
| 2026-02-14 20:57:02 | × | merijn quits (~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 256 seconds) |
| 2026-02-14 20:57:03 | <probie> | I don't think it's LLVM's problem here; GHC is just not generating good code https://paste.tomsmeding.com/8ZYY5Pka |
| 2026-02-14 20:58:11 | <tomsmeding> | why are there so many loads for only two stores? I assume this is different code than you posted originally? |
| 2026-02-14 20:58:23 | × | caubert quits (~caubert@user/caubert) (Ping timeout: 252 seconds) |
| 2026-02-14 20:59:21 | <[exa]> | that looks like a lot of indirection |
| 2026-02-14 20:59:26 | <probie> | https://paste.tomsmeding.com/NRYKh5Fj |
| 2026-02-14 21:00:09 | × | L29Ah quits (~L29Ah@wikipedia/L29Ah) (Ping timeout: 260 seconds) |
| 2026-02-14 21:00:22 | <[exa]> | probie: you have unboxed or primitive vectors? |
| 2026-02-14 21:01:42 | <tomsmeding> | probie: if it's easy to paste the optimised LLVM IR, that would make it easier to see what's going on, probably |
| 2026-02-14 21:02:03 | <int-e> | IOVector is boxed. |
| 2026-02-14 21:02:09 | <tomsmeding> | there are a bunch of loop-invariant loads here that I expect llvm to lift out |
| 2026-02-14 21:02:22 | <EvanR> | last I heard ghc didn't have SIMD support |
| 2026-02-14 21:02:35 | <EvanR> | oh, LLVM |
| 2026-02-14 21:02:52 | <tomsmeding> | int-e: every mutable vector variant has its own definition of the "IOVector" type synonym |
| 2026-02-14 21:02:59 | <[exa]> | int-e: afaik you can import the one from the .unboxed.mutable or .primitive.mutable module |
| 2026-02-14 21:03:10 | <int-e> | tomsmeding: gah |
| 2026-02-14 21:03:28 | <probie> | int-e: it's not I omitted the `import qualified Data.Vector.Unboxed.Mutable as V`. Weirdly, I get slightly better llvm if I use `Storable` instead of `Unboxed` |
| 2026-02-14 21:03:42 | <int-e> | Right. I should've known that. |
| 2026-02-14 21:03:48 | <tomsmeding> | in general, Storable is more straightforward |
| 2026-02-14 21:03:56 | <tomsmeding> | but in theory, either should work here |
| 2026-02-14 21:04:08 | <fgarcia> | llvm goes to at least 23 now. it could be the SIMD changes haven't made it down |
| 2026-02-14 21:04:25 | × | infinity0 quits (~infinity0@pwned.gg) (Ping timeout: 255 seconds) |
| 2026-02-14 21:06:12 | <[exa]> | probie: man, you're introducing a data dependency there, it can't simd |
| 2026-02-14 21:06:45 | <tomsmeding> | [exa]: isn't this code just zipWith (+) |
| 2026-02-14 21:06:50 | <tomsmeding> | oh no |
| 2026-02-14 21:06:57 | <[exa]> | it's writing back to the original vector |
| 2026-02-14 21:07:02 | <tomsmeding> | yeah probie ^ |
| 2026-02-14 21:07:15 | <tomsmeding> | lol |
| 2026-02-14 21:07:34 | → | caubert joins (~caubert@user/caubert) |
| 2026-02-14 21:07:50 | <[exa]> | probie: try this https://paste.tomsmeding.com/GjpwizwI |
| 2026-02-14 21:08:28 | <[exa]> | (edited right into pastebin so didn't try it but you see the point I guess) |
| 2026-02-14 21:08:49 | <tomsmeding> | with this being Word8 you may even want to unroll 32x |
| 2026-02-14 21:09:11 | <tomsmeding> | or at least 16x to use 128bit SSE4 registers |
| 2026-02-14 21:09:20 | <[exa]> | oh |
| 2026-02-14 21:09:27 | <[exa]> | ok I somehow hoped this is at least floats |
| 2026-02-14 21:09:30 | <tomsmeding> | but 4 should at least get you different assembly |
| 2026-02-14 21:09:35 | → | merijn joins (~merijn@host-cl.cgnat-g.v4.dfn.nl) |
| 2026-02-14 21:09:40 | <[exa]> | are there SIMD instructions for chars? |
| 2026-02-14 21:09:48 | <tomsmeding> | yes |
| 2026-02-14 21:10:02 | [exa] | learned today |
| 2026-02-14 21:10:17 | <tomsmeding> | https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=epi8 |
| 2026-02-14 21:10:29 | <tomsmeding> | _mm_add_epi8 is the one you want here (paddb) |
| 2026-02-14 21:11:05 | <tomsmeding> | or the _mm256 version, or _mm512 if you want to use your juicy AVX512 |
| 2026-02-14 21:11:39 | <tomsmeding> | think about that, 64 adds with 1-cycle latency |
| 2026-02-14 21:11:56 | <[exa]> | oh these are the epi8 instructions from the intrinsic guide that I ignored everytime |
| 2026-02-14 21:12:10 | <tomsmeding> | epi is integer stuff |
| 2026-02-14 21:12:20 | × | peterbecich quits (~Thunderbi@71.84.33.135) (Ping timeout: 256 seconds) |
| 2026-02-14 21:12:51 | <tomsmeding> | and apparently it can even do two of those _mm256_add_epi8 instructions in one cycle, by the CPI of 0.5 |
| 2026-02-14 21:13:28 | <tomsmeding> | (yes, the throughput label is misleading; I checked that a div_pd has 4 there and add_pd 0.5, so indeed it's CPI = 1/throughput) |
| 2026-02-14 21:13:46 | <probie> | There isn't really a data dependency though, since memory is never read again after being written |
| 2026-02-14 21:13:52 | <[exa]> | tomsmeding: where do you read that? intel intrinsics guide says 3 per cycle |
| 2026-02-14 21:13:56 | <probie> | oh wait, <expletive> |
| 2026-02-14 21:14:04 | <tomsmeding> | probie: the compiler doesn't know that |
| 2026-02-14 21:14:06 | <probie> | there can be aliasing |
| 2026-02-14 21:14:09 | <tomsmeding> | yes |
| 2026-02-14 21:14:20 | <[exa]> | probie: memory order too strong QQ |
| 2026-02-14 21:14:29 | × | merijn quits (~merijn@host-cl.cgnat-g.v4.dfn.nl) (Ping timeout: 245 seconds) |
| 2026-02-14 21:14:33 | <tomsmeding> | [exa]: oh I mistyped, I meant _mm512_add_epi8 |
| 2026-02-14 21:14:38 | <[exa]> | probie: anyway it might be the case that the compiler just ignores it but I'd bet this is the problem number 1 |
| 2026-02-14 21:14:40 | <tomsmeding> | the _mm256 and _mm variants indeed have 3 |
| 2026-02-14 21:14:52 | <tomsmeding> | this is DEFINITELY not ignored by llvm |
| 2026-02-14 21:15:11 | <probie> | Even gcc only ignores it if you pass -O3 IIRC |
| 2026-02-14 21:15:13 | <tomsmeding> | and I can also assure you that GHC will not tell LLVM that these things do not alias |
| 2026-02-14 21:15:23 | <tomsmeding> | ignoring this is a blatant violation of the semantics |
| 2026-02-14 21:15:42 | <tomsmeding> | I would be surprised if ghc does this at any optimisation level |
All times are in UTC.