06:43glehmann: I kind of think that's llama.cpp's fault then
06:44glehmann: we could start ignoring the loop unroll hints, but that's also sad
06:52glehmann: at some point, the block count is not just an issue for compile times, but also for instruction cache
07:04airlied: glehmann: I think we could do a better job on this shader though
07:04airlied: also for these shaders icache is not the problem, pure mem bw bound
10:04Mary: Does anyone know what that mean https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27196#note_3499899 I'm a bit confused
10:15karolherbst: mareko: So I was looking into pipe_grid_info::last_block again, and I've noticed that on radeonsi the "workgroup_size" intrinsic returns the "full" size in the workgroups on the edges. Is this expected behavior? And is there a way to get the partial sizes on the edges instead?
10:21karolherbst: mhh it looks like AMD hw just doesn't have a hw mechanism for fetching that value? that would be annoying
10:51Mary: Seems that Marge is happy again with my MR
11:18daniels: Mary: it's a rather opaque way of saying that you assigned the MR to Marge, it failed, you pulled Marge's changes (with Part-of included) and pushed on a clean rebase of main; usually Marge does a rebase of your branch on main and pushes the result, which triggers CI automatically, but in this case as the rebase was a no-op, Marge had nothing to push, so all the CI jobs were set to manual as it was pushed by you rather than Marge
11:19daniels: to avoid that in future, if you're pushing stuff yourself, don't include the Part-of: trailer that marge adds
11:26mahkoh: Does anyone know from the top of their head what could cause this issue on an RDNA2 iGPU: https://files.catbox.moe/5bvnqk.png
11:26mahkoh: Concretely: I'm using RADV_EXPERIMENTAL=transfer queue and transfer a buffer to a DCC compressed image.
11:27mahkoh: If I hardcode the "compressed write" parameter in ac_cmdbuf_sdma.c to false, the issue goes away.
11:27mahkoh: Same if I disable the affected modifier.
11:29mahkoh: I'm copying the entie image but the artifacts only appear in those areas that have damage compared to the previous contents of the DCC image.
11:29mahkoh: If I repeat the same copy again later, the artifacts go away.
11:29Mary: daniels: I see so clean up the Part-of, noted thanks!
11:29mahkoh: I was thinking that this could be a sync issue, but so far no amount of deviceWaitIdle and sleeping has fixed it.
11:30mahkoh: Could this also be some kind of sync issue related to CPU caches? I have not been able to reproduce this on my RDNA2 dGPU.
11:31mahkoh: In this case, the source buffer is a CPU-resident buffer written by the dGPU from which the iGPU then copies to the DCC-compressed image.
11:34mahkoh: The modifier in question is 0x020000000056bb03 = AMD_GFX10_RBPLUS,64KB_R_X,PIPE_XOR_BITS=2,PACKERS=0,DCC,DCC_MAX_COMPRESSED_BLOCK=128B,DCC_INDEPENDENT_128B,DCC_CONSTANT_ENCODE,DCC_PIPE_ALIGN. The iGPU supports no other modifier with DCC.
12:25karolherbst: jenatali or maybe alyssa: wanna review an MR fixing cursed CLC edge cases? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41726
13:00alyssa: karolherbst: "want" is a strong worde
14:05mareko: karolherbst: the hw doesn't have that mechanism, it's the same workgroup size but invocations are disabled as if it executed in a conditional block
14:19karolherbst: mareko: right.. I'm just wondering how to emulate this then. I could lower it and try to identify if I'm at the edge, but the most reliable way for this is to compare the current workgroup size with the enqueued one. And not having that kinda sucks. Is there a way to identify how many invocations are active or something like that? Could run
14:19karolherbst: something in the first block and store the value.
14:26mareko: maybe ballot
14:27mareko: bitcount(ballot)
14:29mareko: or put "last_block" into a user SGPR, and do: for (i = 0..2) if (workgroup_id[i] == grid_size[i] - 1) workgroup_size[i] = last_block[i];
14:35mareko: if it's that complicated, it's not worth supporting
15:25karolherbst: mhhh yeah.. I have emulation code for non-uniform workgroups that doesn't rely on changes inside the shaders by just splitting up the launches. Might play around with the idea, but it's going to be better than running kernels with like 5 invocations per workgroup or something funny just to make it all uniform :')
15:27mareko: the purpose of last_block is not having to have the conditional block that skips last block invocations
15:27karolherbst: could make the compute_grid_info_last_block cap require "sane" behavior for workgroup_size...
15:28karolherbst: otherwise it's of little use for frontends
15:28karolherbst: could use last_block if the workgroup_size isn't used tho...
15:29mareko: it was never meant to be used by frontends
15:29karolherbst: well currently nothing uses it
15:29karolherbst: the cap I mean
15:29mareko: and the feature won't be extended beyond what's implemented
15:29karolherbst: yeah, that's fine
15:30mareko: it should reduce the shader binary size, not increase it
15:30karolherbst: I think it might make sense to split the cap into two that mean "supports last_block" and "workgroup_size is last_block in last_block
15:31karolherbst: and then I have all that I need. My emulation code doesn't increase shader code anyway, but I could reduce the amount of launch_grid calls using last_block
15:32karolherbst: and if I can only use last_block on radeonsi for shaders not using workgroup_size that would still be better
15:33mareko: extra loads to get workgroup_size right at the beginning of shaders would reduce perf of small shaders by 10-20%
15:35karolherbst: ohh that's not what I mean, I mean if the applciation code uses workgroup_size, I couldn't use last_block on radeonsi
15:35karolherbst: but if it doesn't, then it would be fine
16:59karolherbst: alyssa: don't worry, it gets worse: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/41970