00:40DemiMarie: karolherbst: Is it possible to use P2PDMA to avoid this?
01:05airlied: DemiMarie: who's the 2nd P?
01:27DemiMarie: airlied: perhaps a VF
01:27DemiMarie: It would be unfortunate if there was no better way to get data from a VF to a PF than through host memory
01:28DemiMarie: And considering the primary use-case for iGPU SR-IOV, I would also consider this quite surprising.
01:34airlied: sounds more like a buffer sharing problem than a p2p problem
08:40haasn: karolherbst: the problem with av1 on gpu is that we want to do almost exclusively 8/16 bit integer ops
08:40haasn: In my current code I just store everything as an int
08:40haasn: But that’s wasteful, 75% of the ALU is unused
08:41haasn: I was thinking that using u8vec4 would let the gpu pack four 8 bit ints into a single register
08:41haasn: Basically upping throughout from 32 pixels / warp / cycle to 128
08:42haasn: But doing something like a convolution with this design is very hard
08:43haasn: Can’t just do your usual subgroup sum etc
08:43haasn: And you need to widen it to 16 bit intermediates
08:45HdkR: Also most GPUs these days don't allow you to do much SWAR anymore
09:33karolherbst: haasn: most GPUs don't have any benefit doing 8/16 bit alu over 32 bit, unless you can use those fancy AI matrix multiplication ops
09:55Company: is there a technical reason why amd and nvidia don't implement GL_EXT_shader_framebuffer_fetch ?
09:56Company: I'm trying to figure out the best way to do HDR colorspace conversions without needing an extra buffer
09:58Company: ie I have an srgb buffer and want to convert it to rec2020, or sth along those lines
10:08dj-death: compute shader + storage image?
10:10Company: then I need to add compute support - which is gonna happen long term I guess
10:11karolherbst: can also use storage image in fp programs I think
10:11karolherbst: or not?
10:11Company: i have no idea - it's probably a driver question too
10:12Company: and/or a version question, because we use GLES these days
10:14Company: I'm just trying to find the smartest way to do this, so HDR can work on the worst possible hardware without lags
10:15Company: with the least amount of work
10:15pq: GL_EXT_shader_framebuffer_fetch_non_coherent any better?
10:16karolherbst: nvidia hardware doesn't support fbfetch natively anyway, and in nouveau we simply bind the framebuffer as a texture to read from
10:16Company: pq: same drivers
10:16Company: more or less
10:18pq: those that don't support fbfetch, do they also have problems with using a temporary buffer?
10:19karolherbst: it's kepler+ in nouveau anyway
10:20karolherbst: I doubt that reading from the framebuffer is efficient on any hardware
10:20karolherbst: I might be wrong
10:20karolherbst: maybe tilers do better here
10:20pq: they have to blend somehow...
10:22Company: I have no idea - the naive solution is to use an extra temporary buffer and that's always gonna work
10:22pq: it's just blending, except one would want to add custom code to mangle the read and written values of the destination - especially the read values
10:22Company: but it means there's an extra buffer involved
10:23pq: I guess it's a problem if blending happens by fixed-function.
10:23Company: I also haven't looked yet at how this works in Vulkan and if/how I can have an image be input and output attachment at the same time
10:24karolherbst: maybe just use an image as in read-write image?
10:26dj-death: karolherbst: we can read the render target cache on Intel, but it doesn't support MSAA very well
10:27dj-death: karolherbst: you have to do non-uniform lowering to fetch each sample
10:27karolherbst: uhh
10:27dj-death: so we've been doing the same as nouveau
10:27dj-death: use the sampler
10:27karolherbst: yeah, I think that's the sanest solution if you only want to have it supported :D
10:27dj-death: but that means you have to be careful with compression
10:30Company: oh, I didn't think about that yet
10:31Company: we've been thinking about using msaa or supersampling for higher quality output, and that obviously interacts
10:32Company: but that's a long way off - first inkscape needs a renderer using gtk instead of cairo
11:19haasn: Here is the source for the shaders if anybody is interested, I couldn’t figure out how to optimize them further https://code.videolan.org/haasn/dav1d/-/tree/gpu-wip/src/glsl?ref_type=heads
11:30FireBurn: Is the AMD 7900M in any laptops? There was exclusivity with Alienware but it looks like they've stopped selling them now
12:37dliviu: mlankhorst: I see that my patch got removed from drm-misc-next-fixes, am I correct that it hasn't been added anywhere else? Can I push it into drm-misc-next?
12:38mripard: dliviu: it's in drm-misc-fixes
12:39dliviu: mripard: thanks
12:45karolherbst: haasn: maybe instead of looping inside looprestoration.glsl, you could see if having one thread doing one iteration be a feasible rework here and see how much of a difference that would make
12:46karolherbst: or at least if you have loops, make sure that threads don't diverge (e.g. having the same iteration across all threads)
12:46haasn: karolherbst: that loop primarily exists because the input size exceeds the output size
12:47haasn: I map threads onto output pixels in a 1:1 manner
12:47haasn: could also try the alternative of mapping threads onto input pixels and then dealing with the fact that some invocations will be idle during the last convolution
12:48karolherbst: yeah, I think that might be fine
12:48mripard: dliviu: it was meant for drm-misc-next?
12:48karolherbst: what matters is, that threads don't diverge in control flow. Like if they all execute loops in lock-step that's fine
12:48dliviu: I was not in a rush
12:48karolherbst: or rather
12:48karolherbst: all threads within a subgroup
12:49karolherbst: if entire subgroups diverge, that's still fine
12:49dliviu: mripard: it doesn't "fix" anything other than adding more compilation exposure to more function in komeda
12:49karolherbst: e.g. have a block size of the subgroup size, and make sure that each block executes the same path through the shader
12:50karolherbst: and then if entire blocks have nothing to do at the end, they can exit early without having to wait for other threads
12:54haasn: the internal block size of this algorithm is 8x8 so I set my WG size to that
12:55karolherbst: yeah, should be fine, although there is hardware with bigger subgroups sizes and I don't know how much that matters there
12:55haasn: there will be some partial executions during the initial "assemble input" phase, that's almost unavoidable because the input pixel count is not a clean multiple of 32
12:55karolherbst: though they might be able to execute two blocks at once?
12:56haasn: AMD used to use 64 sized subgroups, I don't think there are more than that?
12:56karolherbst: broadcom has 128
12:56karolherbst: apparently
12:57karolherbst: and I can see people using rpis as their media center thing, so it might even matter, but honestly don't know if the hw has some smartness here to counteract this
12:57haasn: something I just randomly thought of, for film grain it might be possible to split it into two passes, one to cover all non-edge pixels and a second, separate pass to deal only with edge pixels (that need to be blurred with the previously processed non-edge pixels)
12:58karolherbst: mhh, yeah, might be worth a shot
12:58haasn: the film grain shader diverges quite badly
13:00karolherbst: on some GPUs (nvidia e.g.) having more specialized programs allocating fewer GPRs also leads to being able to run more threads in parallel
13:00haasn: right
13:00haasn: actually, that might be worth doing for loop restoration as well
13:01haasn: instead of if (cond) { path A } else { path B }; invoke two shaders: if (!cond) return; /* path A */ and: if (cond) return; /* path B */
13:02haasn: on the premise that inactive workgroups will very quickly return out, freeing up those resources for more groups
13:02karolherbst: yeah, but sometimes you have to check if some trade-offs are worth doing. It might also make sense to group/sort input and reshuffle data so that blocks stay uniform
13:03haasn: the condition here is on a scalar value shared by tee entire work group
13:03haasn: the condition is granular on an 8x8 level
13:03haasn: (that's why the workgroup size is set to 8x8)
13:03karolherbst: right, in which case it should be fine
13:03karolherbst: just overallocating gprs might be a problem then if there is an imbalance between those two paths
13:04mripard: dliviu: why did it end up in drm-misc-next-fixes then
13:59dliviu: mripard: see my exchange with mlankhorst on Thursday. I've mistakenly assumed that putting things into -fixes is the thing to do for patches that do cleanups, I should have put it into misc-next
14:38sima: airlied, should we just ack https://lore.kernel.org/dri-devel/20240613191650.9913-5-alexey.makhalov@broadcom.com/ ? v11 with no ack from vmwgfx seems a bit excessive ...
14:42javierm: zackr ^
14:42sima: hu I thought I checked but didn't see zackr, I guess I typoed :-/
14:42sima: javierm, thx
14:42javierm: sima: you are welcome
14:48sima: thellstrom, do you need someone to help review the ttm shrinker stuff?
14:48sima: not that my take will be enough by far, but maybe it helps ...
15:08zackr: sima: i thought i already rb that patch twice. did boris make him take down rb's for one of the versions? iirc, that series is blocking a work i actually wanted alexey to get done so i'm happy to see it go in
15:09sima: zackr, hm no idea, maybe complain why the r-b is getting dropped each version?
16:28childrenatwar: so all together the zero bit is carried amongst the fields, for an example 65+64 is 129 for first bit, 66+65 is second bit , so 64 and 65 are the states that are not toggled, so minuend or addend or adder and subtrahand needs -1 shifted encoder to annotate bit states that are not toggled in, and that way everything is going to work out, if subtrahand is 66-1 for third bit, minuend is 0
16:28childrenatwar: or vice versa.
16:30childrenatwar: for untoggled bits, so toggled bit state is 66 for both
16:33childrenatwar: toggled bit state for second bit is in both cases 66 so untoggled is 0 and 65 or 65 and 0
16:47sima: airlied, btw just checked, the kms_lease test already checks for double-leasing, so running that should cover all relevant edge cases
16:47sima: aside from the silly one of having an enormous pile of duplicated ids :-)
19:12rodrigovivi: mripard: gentle ping on this one: https://lore.kernel.org/dri-devel/ZmyfoMYRfKJv16KD@intel.com/ ack on this patch to go through drm-intel-next where we introduced the 'target_rr_divider' anyway? so we fix the build-doc issue...
19:25airlied: haasn: I think intel media-driver has some filmgrain shaders, but not sure if they have source, or it they use special intel media shader features
19:37zmike: mareko: since nobody else will, maybe you want to review https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6691