00:04 mareko: alyssa: no
00:04 alyssa: uh oh
00:04 mareko: alyssa: there is a way to make it perspective-correct using SYSTEM_VALUE_BARYCENTRIC_PULL_MODEL
00:05 alyssa: but radv is ok... not doing that somehow?
00:05 alyssa: (I'm trying to move it to common code, it passes the tests on Intel but..)
00:07 mareko: both AMD drivers do it incorrectly by interpolating linearly within a quad
00:08 alyssa: and that's.. ok?
00:08 mareko: no
00:08 alyssa: :-/
00:08 alyssa: i so want to say "if it's good enough for radv it's good enough for anv" :p
00:09 mareko: if W doesn't vary much between pixels, the error can be small
00:10 mareko: CTS needs coverage
00:11 alyssa: alright :/
00:19 mareko: the test I would use is this: draw a quad into 16x16, pos.w must vary a lot, then draw the same into 2x2 and interpolate in the middle of that, then compare that result with the middle pixel of 16x16
00:21 alyssa: well.. guess I'll just leave https://rosenzweig.io/0001-WIP-brw-use-common-barycentric-code.patch here then.. *sigh*
00:24 mareko: it's better to do it incorrectly in the common code than doing it incorrectly in each driver separately
00:25 mareko: however, ddx/ddy should be at the beginning of the shader because they need helper invocations
00:25 alyssa: right, yeah
00:25 robclark: lumag: btw for infra issues, you might want to ask on #freedesktop
00:25 alyssa: radv seems to have lower_interp_center_smooth for correct with _model
00:26 alyssa: I don't really understand the code, though
00:27 mareko: that might actually be the correct code
00:28 robclark: karolherbst: where is it decided that BuiltInLocalInvocationId and friends are 64b for cl? Is that in llvm? I'm wondering if we can early on realize that the hw size is only 32b and turn them into something sane. Otherwise int64 lowering turns `half y = (half)get_local_id(0);` into something funny/sad..
00:28 alyssa: mareko: inclined to believe it, but I'm still trying to understand load_barycentric_model
00:28 mareko: so I was wrong, RADV does it correctly it seems
00:29 karolherbst: robclark: it's size_t in CLC
00:29 alyssa: comparing https://docs.vulkan.org/refpages/latest/refpages/source/BaryCoordPullModelAMD.html to https://docs.vulkan.org/refpages/latest/refpages/source/BaryCoordNoPerspAMD.html
00:29 alyssa: can I derive _model by just doing frcp on the nopersp centre one (with the appropriate 1-x-y extension to vec3)?
00:30 karolherbst: robclark: but system val lowering should be able to drop stuff to 32 bits
00:30 robclark: hmm
00:30 karolherbst: like local_id should just be u2u64(local_id)
00:30 robclark: https://www.irccloud.com/pastebin/l3W94fUV/
00:31 robclark: the question is why the u2u64 doesn't get optimized out
00:31 karolherbst: robclark: I'd argue you can nuke the u2u64 there via an opt
00:31 karolherbst: int64 not well optimized?
00:31 mareko: alyssa: persp-correct interpolation interpolates x, y, and w separately and the does x/w, y/w; linear interpolation does x/w, y/w first and then interpolates the results
00:31 alyssa: robclark: opt_algebraic is a missing a ton of 16/64-bit stuff
00:31 robclark: maybe a missing algebraic opt..
00:31 robclark: yeah
00:31 karolherbst: ohh I also have a fun one
00:32 alyssa: mareko: right. thanks
00:32 karolherbst: i2i64(ushr(a, 5lsb something)) -> u2u64(ushr(....)) I have to upstream that one
00:33 mareko: alyssa: I think barycentrics are already divided by W, so interpolating those would result in linear interpolation, which I think is why pull_model gives us i/w, j/w, 1/w, which is the original x, y, w (I think), so we interpolate those, and then compute x/w, y/w to get persp-correct i, j
00:39 mareko: pull_model can be used for both linear and perspective interpolation - linear interp would divide by W first and then interpolate, while persp interp would first interpolate and then divided by W
01:02 lumag: robclark: good idea
01:20 Company: any Intel person here? I'm running into a problem with everyone's favorite format VK_FORMAT_G10X6_B10X6R10X6_2PLANE_444_UNORM_3PACK16 aka P010, where the SAMPLED_BIT is set, but neither COSITED_CHROMA nur MIDPOINT_CHROMA are set as supported
01:20 Company: I've read the code, and that's because the format is marked as VIDEO but not as YCBCR
01:21 Company: and I'm not sure if that's an oversight and it should be marked as YCBCR, or if it should not set the SAMPLED_BIT
01:21 Company: because if a multiplane format can be sampled, it must set either of the chroma flags
01:32 Company:decides it's probably because of the R10X6 which would require extra support for sampling, and prepares a patch that marks the format as non-samplable
01:37 Company: or it's because the Vulkan spec doesn't explicitly forbid sampling without ycbcr, only for VK_IMAGE_ASPECT_COLOR_BIT but not for VK_IMAGE_ASPECT_PLANE_0_BIT etc
09:15 dolphin: sima, airlied: drm-intel-next-fixes PR will be slightly delayed, there's build error on xe side at the moment
09:15 sima: ack
09:16 dolphin: hopefully still good to send tomorrow?
09:26 airlied: during merge window, there isn't really a time, I probably won't do -next-fixes until end of next week unless something urgent
09:39 dolphin: airlied: ack, was just thinking maybe you have a routine ;)
18:00 cheako: The intel guys have been quiet lately, who's responsible currently? https://cdn.discordapp.com/attachments/821523025966792754/1469839749924655246/image.png?ex=698dbbb0&is=698c6a30&hm=26893e9920bb8835f0ccde2f238e2547c0afdfd16b5c8504e76f2dd847c7141a I have a gfx-reconstruct of this, it's 250M. There should be a platoon of soldiers standing around, like the finger should be pointing at one.
18:03 sghuge: cheako: can you please file issue gitlab issue and also try to attach the trace? Thank you.
18:08 cheako: As I do get, "How else are we to copy 240mb around.".. My experience has been that issue trackers are just a great place for unsightly things to be forgotten.
18:08 cheako: I love it when ppl claim, that they just can't think of a way to get any computer, any computer in the world, to copy the behavior... you hear me yelling, "I have one!"
18:18 sghuge: cheako: ohh maybe don't upload trace and leave that to person who will be debugging the issue! also, I am getting content is no longer available when I access the link.
18:18 sghuge: can I know which game is that? I can try to repro locally on my LNL system.
18:20 cheako: It's final fantasy tactics(2025)
18:21 cheako: Just when you view party, there should be a formation of soldiers.
18:21 sghuge: ACK! getting the game.
18:24 sghuge: cheako: the ivalice chronicles right?
18:30 cheako: yes
18:36 jimc: is there a gitlab drm-CI instance somewhere I can push to ?
20:44 valentine: jimc: what are you looking to do? There's a DRM CI instance on freedesktop that the drm/msm folks use: e.g. https://gitlab.freedesktop.org/drm/msm/-/merge_requests/218
22:35 robclark: karolherbst: btw, I'm bumping in to rusticl not being able to pipeline kernel submits, ie. looks like same thread does pctx->flush(&out_fence) followed by reading query result w/ wait=true. I guess this is happening from same thread that dispatches new grids? After some rough opt-algebraic hacks (to get past the int64 thing), I think this is the last thing that blob cl is faster than us at (since it is able to queue
22:35 robclark: submitting new work to keep the gpu loaded, whereas rusticl lets the gpu go idle)
22:35 robclark: I remember talking to you about this.. maybe 6mo back?
22:35 robclark: I thought you did some opts in the mean time, but maybe not?
22:36 karolherbst: robclark: mhh.. I still have this annoying barrier in the middle...
22:36 robclark: (I'm half way just looking for a hack at this point.. just trying to scope out all the different things that need improvement on the performance front)
22:36 karolherbst: `memory_barrier(PIPE_BARRIER_GLOBAL_BUFFER)`
22:36 karolherbst: but not sure if drivers do anything with it anyway
22:36 robclark: hmm
22:37 karolherbst: and not sure we even need it at all
22:37 robclark: I think this is coming in the get_query_result() path
22:37 karolherbst: maybe for cross context stuff
22:37 karolherbst: ahh...
22:37 karolherbst: yeah that's a blocking wait
22:37 robclark: memory barrier, we should be able to do on the gpu
22:37 karolherbst: sooo..
22:37 karolherbst: profiling in CL is a bit annoying
22:37 robclark: (so maybe not best for perf but at least it is keeping gpu spun up)
22:37 robclark: right
22:38 karolherbst: you enable profiling on the queue and atm rusticl does a blocking query for it
22:38 karolherbst: I have something for it...
22:38 robclark: hmm, I'm just lookign at something simple (clpeak).. IIRC it does enable profiling?
22:38 robclark: (so I guess fixing this is moar clpeak for everyone?)
22:38 karolherbst: there is no other way to get the query thing
22:39 karolherbst: but this is fixable
22:39 karolherbst: just...
22:39 karolherbst: not esaily
22:39 robclark: I guess I could just hack my get_query_result() to lie instead of block..
22:39 karolherbst: well
22:40 karolherbst: or rusticl could use get_query_result_resource
22:40 karolherbst: I even have patches for it
22:40 karolherbst: sadly it doesn't work with every driver
22:40 karolherbst: like in principle it works, but I still have to sync the query result with the host timer
22:40 karolherbst: and some drivers don't have synchronized timestamps across different APIs
22:41 karolherbst: like I have to record when an even got enqueued, I can hardly do that on the GPU
22:41 robclark: well, I have the whole ticks vs ns issue.. but iirc we decided there was a workable hack
22:41 karolherbst: right.. I'm more concerned about iris here
22:41 karolherbst: because it has a 16.8 factor or something
22:42 karolherbst: but yeah.. we could maybe define a gallium interface how to translate between get_query_result_resource and get_query_result
22:42 karolherbst: or rather.. pipe_screen::get_timestamp
22:43 robclark: yeah, if I hack get_query_result() to not block, then I'm saturating the gpu... maybe too much if I run with a high # of iterations :-P
22:44 karolherbst: https://gitlab.freedesktop.org/karolherbst/mesa/-/commits/rusticl/profiling/good?ref_type=heads
22:45 karolherbst: I just hope no driver is doing weird things if the get_query_result_resource resource is being suballocated...
22:46 robclark: would it be feasible to somehow let 1 or 2 enqueued kernels jump ahead of reading the query result, ie. so you are blocking on a slightly older query?
22:46 robclark: I'll have a look at that
22:46 karolherbst: mhhhhh not really
22:46 karolherbst: or maybe...
22:47 karolherbst: the issue is, that it's all threaded
22:47 karolherbst: and it's a context function
22:47 karolherbst: and I don't think you can read out a query on a different context
22:48 robclark: do a non-blocking result read, if that fails push the job back on the queue?
22:48 karolherbst: but the application could try to fetch the result and I can't just say "check back later"
22:48 robclark: fwiw, it appears blob does _something_ like that, I see it doing non-blocking fence waits
22:49 robclark: sure, you could block until the result is avail if app is asking for it?
22:49 robclark: I assume?
22:50 karolherbst: maybe? As I said, it's all threaded and the context isn't available from the apps calling thread. So I might have to add more competent communication between a queues thread and the applications thread
22:50 karolherbst: but then I think it would be better to just use get_query_result_resource anyway...
22:50 robclark: yeah, maybe
22:51 karolherbst: with get_query_result_resource the expectation is that the GPU writes it as part of the command stream and I just map the resource and be done
22:51 robclark: I guess that moves blocking waiting for the result resource into a different thread?
22:51 karolherbst: I could do a non blocking map
22:52 robclark: sure.. but you (I assume) don't want to read the result before gpu has written it?
22:52 karolherbst: I have a different thread that signals cl_event objects when the gpu is done with it, so I do know if I can read it out
22:52 robclark: ahh, ok
22:52 karolherbst: or rather, I know if the even is in CL_COMPLETE state which is the only state that information must be available
22:53 robclark: ok.. I guess I can take a look at rebasing your branch and add pscreen->convert_timestamp() sorta thing
22:54 robclark: do you know if any drivers that can do rusticl doesn't support qbo?
22:54 karolherbst: there is a bit of a discussion on that matter, because apparently even if you know the event is in CL_COMPLETE it's not even a sync point, so CL doesn't even guarnatee that memory writes of that event are done or rather it's not as clearly specified 🙃
22:54 karolherbst: mhhh
22:54 karolherbst: I think zink was a bit of misery there
22:54 karolherbst: but I think that's fixable
22:54 robclark: doesn't zink do gl46?
22:55 karolherbst: it does, but it doesn't mean it wouldn't block in get_query_result_resource
22:55 robclark: (that said, my attempt to rusticl on zink wasn't highly successful)
22:55 karolherbst: which I think was the problem, that zink blocks in get_query_result_resource
22:55 robclark: oh...
22:56 karolherbst: worst case.. I launch a kernel doing shader clocks...
22:57 karolherbst: also what's the problem with zink on turnip?
22:58 robclark: I didn't really debug
22:59 robclark: also, I don't think I'd want kernel w/ shader clocks being the default... I'm trying to optimize here..
22:59 karolherbst: good thing that profiling is opt-in and if your app cares about perf it shouldn't enable it 🙃
23:00 robclark: cough, cough, clpeak..
23:00 robclark: I'm not sure how common this is for "real" apps.. but tflite does seem to do some things with profiling enabled..
23:01 karolherbst: yeah... my favorite solution would be to use get_query_result_resource, becuase that's probably as close as you can get to real numbers
23:01 karolherbst: not sure how it's across all vendors, but on nvidia it's like a command stream thing and you just write a timestamp at a GPU VA and are done
23:01 karolherbst: can't possible have a lower overhead than that
23:02 karolherbst: and it's also probably the closest you can get to what's going on the gpu
23:03 robclark: it is kinda like that for us.. there is a CP_EVENT_WRITE event that trickles thru the pipeline of all the gpu stages and when it gets to the end it writes a timestamp back to memory
23:03 robclark: (because a gpu is like schrodinger's cat)
23:03 karolherbst: yeah
23:04 karolherbst: that should get way more reliable numbers than like doing something on the CPU which might even involve an ioctl or something...
23:04 robclark: right
23:05 karolherbst: though I guess get_query_result is the same as get_query_result_resource just that the synchronization happens at a different time and isn't necessarily stalling the pipeline
23:05 karolherbst: I could probably also just spin on get_query_result ....
23:06 karolherbst: but anyway, it's all annoying becausese it makes the CPU do things it shouldn't do (on that thread).
23:06 robclark: as long as you let other kernels thru, that'd be fine.. spinning on the query result has the "benefit" of ramping up cpufreq :-P
23:07 karolherbst: heh
23:07 HdkR: We doing spins in here?!
23:07 HdkR: WFE please :)
23:12 robclark: HdkR: hmm, I guess WFE won't do you much good until you get an irq from the gpu?
23:13 HdkR: robclark: Depends on what you're spinning on. If it's just spinning on a value in a cacheline somewhere then WFE always wins.
23:13 HdkR: Even if that cacheline is only updated once and IRQ fires and the kernel updates that value or whatever.
23:13 HdkR: s/and IRQ/an IRQ
23:14 HdkR: You'll save power in that spin-period
23:15 robclark: but.. my spacebar heater!
23:16 HdkR: I work on a product that is a spaceheater for your face, I'll take any power savings.
23:16 robclark: :-P
23:20 HdkR: karolherbst: If you implement it as a 32-bit or 64-bit spin on a value, let me know and I'll implement the x86/ARM power-savings version of the spin for you