14:26jenatali: karolherbst: I think I just figured out a way to do BDA in dzn, the exact same way that CLOn12 emulates pointers (idx+offset pairs), but instead of index being a locally bound array index, just make it a global resource ID which we already have for descriptor_indexing
14:27jenatali: I don't think I would've come up with that if you hadn't suggested rusticl on zink on dzn, so thanks for that
15:03karolherbst: jenatali: cool. But would that also work with arbitrary pointers?
15:04karolherbst: though _might_ be fine
15:04karolherbst: or at least for most things
15:05karolherbst: jenatali: I think the only problem you'd need to solve is to keep the idx valid across kernel invocations for things like global variables or funky stuff kenrels might do
15:05jenatali: What do you mean arbitrary pointers? When the app asks for a buffer address, I just give back an index and offset
15:05jenatali: Yeah, that's what I mean by a global index
15:05karolherbst: sure, but applications can do random C nonsense
15:05karolherbst: right..
15:05karolherbst: yeah.. then it should be fine
15:06jenatali: Yeah, it won't be stable for capture/replay but that's a different feature so that's fine
15:06karolherbst: so set_global_bindings would return an index and offset packed into 64 bits, I pass this into the kernel via ubo0 (kenrel arguments) and then it should be good to go
15:06jenatali: Right
15:07karolherbst: and gallium doesn't use load_global(_constant) and store_global for anything, so you can deal with the madness there
15:07karolherbst: I wonder if I want to support different pointer layouts directly, but....
15:08jenatali: Well I don't have that bindless path in the gallium driver currently, only in dozen
15:08karolherbst: the CL path is really special sadly
15:08karolherbst: we have this `set_global_bindings` api which is a bit funky...
15:08karolherbst: but that's everything you'd need
15:08jenatali: Yeah makes sense
15:09karolherbst: luckily there are no bindless images or anything
15:09karolherbst: and `set_global_bindings` basically means: give me the GPU address for those pipe resources, and make them available on compute dispatches
15:09karolherbst: *for
15:10karolherbst: there is also some funky offset business going on, but iris/radeonsi/zink have it correctly implemented
15:11karolherbst: jenatali: uhm.. there is another thing: `pipe_grid_info::variable_shared_mem`, no idea if you can support that
15:12karolherbst: how are CL local memory kernel parameters currently implemented on your side?
15:12jenatali: Only by recompiling shaders
15:12karolherbst: mhhh
15:12jenatali: Same with local group size because that's a compile-time param in D3D
15:13karolherbst: I see, so you have to deal with pain like that already anyway
15:13jenatali: Yeah
15:14karolherbst: kinda sucks, but not much you or I could do about it...
15:15jenatali: karolherbst: btw, I noticed you're computing a dynamic local size by using gcd() with the SIMD (wave) size and the global size. That's always going to return 2 for even global sizes and 1 for odd, since SIMD sizes are powers of 2
15:16jenatali: I was looking because CLOn12's handling of odd global dimensions was... Bad
15:16karolherbst: yeah...
15:16karolherbst: I reworked that code tho, just never landed it as it was part of non uniform workgroup support
15:16jenatali: Cool
15:16karolherbst: it doesn't matter anyway as most applications aren't silly enough to run into this edge case
15:17karolherbst: can you support non uniform work groups?
15:17karolherbst: if so.. doesn't matter long term anyway
15:17jenatali: Not natively
15:18karolherbst: mhhh
15:18jenatali: karolherbst: apparently Photoshop does
15:18karolherbst: figures...
15:18jenatali: At least that's what one of our teams is telling me
15:18karolherbst: yeah.. it makes perfect sense if they use image sizes for stuff
15:20karolherbst: but uhhh.. why do you think I'm using the simd size with gcd?, I'm using the thread count and the grid size
15:20karolherbst: subgroups only as a last ressort if things align really terribly
15:20karolherbst: *SIMD size
15:21karolherbst: `optimize_local_size` is what I'm looking at
15:23karolherbst: so if you have 512 threads and a grid of 500x1x1, you'd get 500x1x1 still
15:24karolherbst: it just has some weirdo edge cases where it uses terrible local sizes
15:24karolherbst: I don't like the third part of that function and it could be better, but it's not _as_ bad
15:30jenatali: Hmm ok, I thought I saw SIMD size in there
15:30jenatali: The gcd is still always going to be 2 or 1 though, since that thread count will also be a power of 2
15:44karolherbst: it can be any pot number
15:44karolherbst: if your gpu supports 1024 threads, you have 2^10 on one side, and anything else on the other one
15:44jenatali: ... Yeah that's what I meant
15:44jenatali: A power of 2 or 1
15:44karolherbst: ahh yeah, fair
15:45karolherbst: the last block is supposed to fill it up if the middle one couldn't find a pot of a SIMD size or bigger
15:46karolherbst: so if the loop manages to set local to the SIMD size, fine, nothing else to do. I just wanted to prevent sub optimal distribution of threads
15:46karolherbst: _however_
15:46karolherbst: threads doesn't have to be pot
15:46karolherbst: intel is kinda weird there...
15:47karolherbst: jenatali: https://github.com/KhronosGroup/OpenCL-CTS/issues/1716
15:48karolherbst: there are some intel extensions to make better use of it, and I also kinda have to take that into account
15:48jenatali: Fun
15:48karolherbst: but I also kinda wanted to finish non uniform first
15:48karolherbst: the intel extension e.g. allows you to set the subgroup size
15:49karolherbst: but yeah.. that part of the code has a big TODO to take all of that into account