00:51 alyssa: hmm.. apple has a hw instruction for OpSubgroupShuffleUpINTEL ... it seems pretty useful for image processing
00:52 alyssa: wonder why that never got standardized. maybe it's not as useful as apple's docs claim? :P
00:55 alyssa: cmarcelo: I see https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7145 was closed..?
01:11 alyssa: otoh I can't find much evidence of anyone using the metal function so... shrug?
09:04 mlatus: Hi everyone, asking for help: I got the following error trying to test hardware acceleration with glmark2: `MESA: error: CreateSwapchainKHR failed with VK_ERROR_INITIALIZATION_FAILED MESA: error: zink: could not create swapchain`. I'm running on an unusal environment: a lxc container on an Android device where /dev/kgsl-3d0 and /dev/dma_heap are bind-mounted from host. The display is provided by Termux-X11. Mesa is compiled with freedreno-kmds=msm,kgsl.
09:05 mlatus: The GPU is Adreno 740. mesa-zink installed from termux-repository works. Is this an already known issue? How do I get more detailed logs? I tried to set MESA_LOG_LEVEL=debug but the error message above is all I got so far. Thanks!
12:53 Lynne: any tricks to convince glslang not to inline functions so obsessively?
14:15 alyssa: Lynne: bad news for you about gpu drivers
14:15 alyssa: :p
14:16 Lynne: au contraire, GPU driver compilers are very fast, sophisticated and highly optimized pieces of code that excel at runtime compilation
14:17 Lynne: ...then there's glslang, the second worst piece of open source software
14:19 Lynne: https://github.com/cyanreg/FFmpeg/blob/vulkan/libavcodec/vulkan/ffv1_enc.comp#L24-L89
14:19 Lynne: this tiny piece of code is enough to take 60 seconds to compile, IF it compiles at all
14:21 Lynne: I can't explain why, put_rac is 20 lines of code of barely a few dozen adds and mults
14:23 Lynne: if I add just one more put_symbol(), glslang errors out with an unknown error, just gives up
14:24 MrCooper: -EGONESHOPPING
14:25 glehmann: Lynne: do you use the glslang optimization options? I thought by default it barely does anything and the output spirv is really close to the source
14:25 Lynne: nope, everything's disabled
14:30 Lynne: even pre-baking all function parameters still makes it choke
15:33 cmarcelo: alyssa: I can revive if there's any interest. i kind of liked it having the "overflow" extra data.
15:40 cmarcelo: alyssa: note the shuffle themselves were implemented. that extension has a bunch of other things.
15:43 DemiMarie: @_oftc_mlatus:matrix.org: does Mesa support the kernel driver you are using?
15:48 alyssa: cmarcelo: yeah.. I realized this morning shuffle-and-fill should be equivalent to bcsel-and-rotate
15:48 alyssa: so even if we don't end up vk_ext'ing it, there might be an isel opportunity
15:49 alyssa: apple docs suggest that this operation forms the building block for efficient convolutions, and in fact m2+ has a single instruction (!) for "quad shuffle and fill, then fma with that result"
15:50 alyssa: but if amd/nvidia don't benefit I can understand it not getting standardized :P
15:50 cmarcelo: alyssa: is there a URL reference to the metal one?
15:54 alyssa: cmarcelo: metal docs are https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf search for shuffle_and_fill but https://developer.apple.com/videos/play/tech-talks/10876/ is honestly more useful for understanding why this thing exists (last 1/3 of the video)
15:54 alyssa: https://developer.apple.com/videos/play/tech-talks/10876/?time=1067 rather
15:57 cmarcelo: alyssa: tks. will take a look. for intel, we can also do a trick with two consecutive regs and regioning depending on the case. would it be useful to you if we made an NIR intrinsic for this? (right now we lower right at spirvtonir)
16:00 alyssa: ye
16:03 cmarcelo: alyssa: ok, will ping you when I have something
16:03 alyssa: cool :}
16:03 alyssa: then the harder problem, getting devs to use it :P
16:04 cmarcelo: yeah, that's another step
16:04 cmarcelo: now will effectively move the lowering from spirv into the relevant nir pass.
16:04 alyssa: good news is that it can be lowered easily to bcsel&rotate so we can flip on the hypothetical ext for every vk driver in mesa all at once :P
16:43 cmarcelo: alyssa: re: video. unless I'm missing something, the _fill part helps a little bit + ergonomics but seems the winner there is the fact you can shuffle stuff no? (i.e. without the fill you'll still be able to do the algorithm, just using two inst instead of one)
16:45 cmarcelo: (the "win" being less samples being taken)
17:09 alyssa: cmarcelo: yeah, for sure
17:09 alyssa: which is why it would make sense to expose in vk/cl even on hardware that needs it lowered
17:10 alyssa: I have no idea why Apple doesn't do so. Force people to buy new hardware, maybe :<
18:47 bluetail: alyssa with the option to downgrade iOS you get back performance. Newer iOS is usually demanding more. Same for macOS.
23:14 Ermine: I've got a feeling that dotclock can be computed using other parameters of a given mode. Am I wrong?