01:03sghuge: airlied: okay, I tried 4 times, and still I am hitting issue with vmware tests. https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/42282
01:04sghuge: changes are all anv related, it should not break anything on vmware though.
01:06airlied: sghuge: submit an MR moving the vmware file to disable the ci farm and I'll ack it
01:07sghuge: airlied: ACK!
01:18airlied: sghuge: should be good to resubmit the first one
01:20sghuge: airlied: Thanks! submitting first one now.
10:01MrCooper: "Not to bandwagon", he says, and proceeds to do it anyway
10:15FireBurn: airlied: It seems most of my emails directly to your gmail account are bouncing as spam
11:15airlied: FireBurn: uggh, maybe hit @redhat.com but if gmail bounces your hosting must be pissing it off somehow
13:50karolherbst: anybody ever looked into vectorizing shaders like this? We could executed all those ffmas as vec2: https://gist.githubusercontent.com/karolherbst/bf21aa5e177dd998bf4c396452cf4a7b/raw/22c97401cbb5373957e6c95dfd180e8181fb0f0d/gistfile1.txt
14:05karolherbst: Sooo.. I was considering how to increase compute performance, and I've seen e.g. vkpeak to create compute shaders with small workgorup sizes, e.g. 32. And the shader don't even rely on the workgorup_size. And I was wondering if for Vulkan we want to have a pass that converts a fixed workgroup_size to variable _if_ the shader does not access the
14:05karolherbst: workgroup_size from an API perspective (system values) and if we then want to enable vulkan drivers to dispatch with e.g. 256 invocations per workgroups instead if the API level dispatch allows for that (e.g. groupCounts have POT factors)
14:05karolherbst: and also wondering if that could cause other side-effects I'm not aware of
14:39glehmann: karolherbst: why would a larger workgroup size help?
14:39karolherbst: glehmann: higher occupancy
14:39glehmann: as long as the workgroup size is not smaller than the wave size, this doesn't matter at all on amd
14:39karolherbst: like for hiding memory latency and stuff
14:39karolherbst: on nvidia the hw switches to a different subgroup when it's waiting on memory e.g.
14:39karolherbst: and you can only have a limited amount of workgroups resident
14:40glehmann: weird af hardware design
14:40karolherbst: so if the workgroup is bigger, you can have more subgroups availalbe to switch to
14:40glehmann: well on amd we can have as many workgroups as subgroups
14:40karolherbst: on nvidia we don't :)
14:40karolherbst: it kinda only matters if the shader also uses few registers
14:41karolherbst: like 24 or so
14:41karolherbst: with a workgroup size of 32, 24 regs, I get a occupancy rate of 33%
14:41karolherbst: so dispatching 128 sized workgroups would be optimal
14:41karolherbst: and then I get 48 warps/SM instead of 16
14:41karolherbst: warps == subgroups
14:42karolherbst: not that it matters for vkpeak, because it doens't really do memory ops
14:42karolherbst: but in theory it does for more complex stuff
14:42simon-perretta-img: karolherbst: For your first point, isn't that what nir_opt_vectorize would do? We were eventually looking to do something like that since rogue can issue 2xf32 ops per instruction
14:42glehmann: but you can't always merge workgroups because of forward progress
14:42karolherbst: simon-perretta-img: well it can't vectorize this pattern
14:43karolherbst: glehmann: yeah.. I'm sure there are more restrictions besides "workgroup size is visible"
14:43simon-perretta-img: Oh I see...
14:43karolherbst: yeah.. the start and end of the chain is all scalar, so it doesn't really have a starting point
14:45karolherbst: glehmann: but I think for nv forward progress isn't an issue because the subgroups are independent and make progress anyway, even if the workgroup is bigger. I think even a workgroup barrier wouldn't be an issue
14:45karolherbst: but noooot quite sure there
14:46karolherbst: probably safer to not allow it when workgroup barriers are used
14:47karolherbst: but if the shader doesn't access the workgroup_id/size then I don't really see off hand where it could cause issues mhh...
14:47alyssa: glehmann: iirc on apple does quite some shenanigans to repack workgroups
14:47simon-perretta-img: I guess vec2s could be built from the sources to make opt_vectorize work but that'd be a lot of duplication... and I'm also guessing creating a new vec2 ffma2(a, b, c, x, y, z) op might be controversial lol
14:48alyssa: karolherbst: no, you can't do it with barriers
14:48karolherbst: alyssa: generally, I agree, but I think on nv it would be fine under certain conditions
14:48alyssa: workgroup_size fixed = 32
14:48glehmann: alyssa: I know that some hw has options to repack small workgroups into one subgroup, but 1:1 subgroup/workgroup being bad is just wild hw design
14:48alyssa: if (gl_GlobalInvocationID.x < 32) { barrier(); }
14:48alyssa: hang.
14:49karolherbst: alyssa: ohh... right, I was only thinking about increasing the size..
14:49alyssa: glehmann: for sure
14:49karolherbst: well it's not bad if you burn 128 regs :D
14:49alyssa: glehmann: apple has no hw for this, they dispatch a helper 1x1x1 compute kernel to repack before indirect dispatch (:
14:50karolherbst: incredible
14:50glehmann: fun
14:50karolherbst: _anyway_ sounds like there is some interest for having that also for asahi then?
14:50karolherbst: or is the hw already doing that?
14:50karolherbst: I mean..
14:50karolherbst: uhm..
14:50karolherbst: driver
14:51karolherbst: glehmann: I think the point is rather that having bigger workgroups is better than 1:1, because it allows for better memory latency hiding
14:52karolherbst: and 1:1 being optimal means you just lack the ability to hide memory through subgroup switching
14:52karolherbst: *memory latency
14:52karolherbst: _but_
14:52karolherbst: I guess you can also just have enough space in the hw to do both
14:52glehmann: yes, I know, but reasonable hardware just has as many workgroup slots as subgroup slots
14:52karolherbst: depending on how many subgroups you can have "available" on the hw level
14:53karolherbst: I mean.. we do have quite a bit of slots still
14:53karolherbst: 16 subgroups / MP/SM/whatever you call it
14:53karolherbst: ehh
14:54karolherbst: 48
14:54glehmann: amd can have 16 subgroups per SIMD, and each of those subgroups can come from a different workgroup
14:54karolherbst: it was 16 workgroups / SM
14:54karolherbst: but 48 subgroups on nvidia
14:54karolherbst: ampere
14:55karolherbst: yeah.. sounds like nvidia just allows more subgroups than AMD then
14:55glehmann: no
14:56glehmann: at least I think your SM is what amd calls compute unit
14:56glehmann: and a compute unit has multiple SIMD uints, so we have 32 subgroups max per "SM"
14:57karolherbst: mhhhh
14:57karolherbst: it's complicated.. but SM is kinda like the lowest level of processing unit
14:57karolherbst: above SMs are TPCs and we have some hardware that has 2 SMs per TPCs
14:58karolherbst: https://developer-blogs.nvidia.com/wp-content/uploads/2022/03/Full-H100-GPU-with-144-SMs.png
15:01karolherbst: anyway.. point is, for us it might help to have bigger workgroups
15:02karolherbst: it becomes a bit fuzzy if we can even do that if the shader uses shared memory.. but then it would also be beneficial to have bigger subgroups, but I think the amount of shaders we can safely increase the workgorup size and use shared memory is 0
15:03glehmann: according to techpowerup, H100 has 114 SMs and 14592 cuda cores. Meaning 128 cuda cores per SM. One rdna4 CU has 2 32 wide SIMD units, which also equals 128 if you double count thanks to dual issue
15:08karolherbst: simon-perretta-img: yeah... I don't really have a good idea how to write a general optimization to merge two fp16 ops without risking causing other issues...
17:55mareko: the first RDNA supported 40 subgroups per CU, but it was reduced to 32 shortly after
17:55mareko: having more subgroups isn't free, anything that's extra has a cost, and sometimes the cost isn't worth it
19:59mareko: alyssa: I wonder if you can ack this: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41681
20:00alyssa: don't have time to do a real review right now
20:11Kayden: mareko: that looks like a fun set of optimizations :) "the only defined value is [0]. so you get [0], because if you asked for anything else, it's undefined, so you get [0]" heheh
20:39mareko: glehmann: I wonder if you can ack this: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41681
21:21glehmann: I will try to find some time tomorrow