00:43Hazematman: Has anyone looked at ASTC support for lavapipe? I thought about doing something similar to anv where there is a hidden emulation plane but I'm running into some weird issues. If I allocate the resource from llvmpipe as ASTC format I run into issues with size compatible format rests in CTS and if I allocate the resource as a RGBA32 format I run into issues with mip map logic not working
00:43Hazematman: Trying to understand the llvmpipe jit code and how it handles formats where resource block size and view format block size don't match
00:51airlied: Hazematman: probably want to allocate two resources or just decode on initial use into a RGBA8 or 32 format?
00:52airlied: I can't see why llvmpipe itself would ever see ASTC
00:52airlied: but maybe llvmpipe_texture_layout needs some adjusting
00:54Hazematman: airlied: My approach right now is allocating two textures, the RGBA8 format works fine and holds the decoded version of the ASTC format. Its getting the ASTC format to behave where I'm running into fun corner cases :P
00:58airlied: the texture layout code should handle blocks fine, but I've no idea how astc get laid out
01:00Hazematman: I think the layout code is fine, I suspect is something with the layout being different between RGBA32 and ASTC thats causing issues when the compatible format tests in vulkan where it creates a RGBA32 view of an ASTC texture
01:10mareko: mahkoh: it depends on the access pattern; if a shader subgroup reads one 64x1 block from a linear texture, it's optimal; if it reads one 8x8 block, it's inefficient because it might end up reading 16x8 pixels (1 cache line * 8) from the linear texture in order to get 8x8 pixels
01:12mareko: fragment shader subgroups are always tiled, so that will be inefficient, but you can use compute shaders and make the workgroups 64x1 to make it efficient
01:14mareko: radeonsi (amd/common) has a compute shader blit path that chooses the workgroup and overall access pattern based on the image layout in memory to make it always optimal
01:14mahkoh: For simplicity we can assume that either the entire image is copied or one rectangular region is copied.
01:14mahkoh: Usually closer to a square than a line.
01:18mahkoh: I guess the worst case would be a vertical line with a 15x overhead. How much does this actually matter on iGPUs?
01:20mareko: the type of GPU doesn't matter
01:21mareko: a fragment shader isn't going to be very fast, but compute shaders can, or use a blit function in the API and the driver will try to choose the optimal path
01:23mahkoh: It would be complicated because in general I have to blend many layers and only some of them are linear. So I would either have to use compute shaders for everything or change the layout repeatedly.
01:23mahkoh: Sometimes I also have to apply scaling to the source image.
01:30mahkoh: If the memory fetch is the only overhead, then I think that is probably fine. For small regions where the overhead is relatively large, it will be fast enough because the overhead is on an order of copying 15 additional pixels per row. And for large regions the relative overhead is low.
01:32mahkoh: Am I understanding correctly that for a full image copy there would be no overhead?
01:55mareko: the overhead is for every shader subgroup because every subgroup reads 8x8 pixels (or similar area) separately, so the inefficiency doesn't change with the size of the blit
02:42mahkoh: Couldn't the driver choose the subgroup dimensions to better match the memory layout of the source?
02:43mahkoh: E.g. 16x4
02:44mahkoh: Assuming that the dimensions can be chosen independently for each vkCmdDraw invocation so as to not deoptimize copies from tiled textures.
07:17mareko: mahkoh: subgroup dimensions can't be chosen for fragment shaders, the hw packs any quads (2x2 groups) that should execute FS into the same subgroup automatically, usually the quads are also horizontally or vertically adjacent
07:19mahkoh: Do gpus have L1, L2 cache and such? If so, how much overhead is it really if the next subgroup actually needs that memory?
07:22mareko: if you can fit everything into L2, it might not be that bad
07:23mahkoh: How much would that be on current generation iGPUs?
07:28mripard: lumag: ping for https://lore.kernel.org/all/Zbou-y7eNhQTMpKo@phenom.ffwll.local/
07:33mareko: mahkoh: the L2 cache is between 256KB and 2MB on iGPUs
08:21mahkoh: That is less than I would have thought.
08:24mahkoh: Thanks for explaining these things.
14:55zmike: pepp: re: that issue you raised, I'm stuck on a different project for a week or two, but I can review if you want to try a fix
16:18pepp: zmike: sure, I'll try. I'm not sure if all callers of pipe_resource_release might cause the same problem or not
16:34zmike: pepp: I think that should be the only one? or at least there can't be very many cases where glthread itself is deleting resources
16:34zmike: and thanks!
17:54mareko: I guess that will help with the creo crashes