00:50 karolherbst: mhh now I remember why I haven't implemented it with persistent mappings.. I need to get my head around how to fit this into rusts ownership model.. but I have an idea actually..
04:44 karolherbst: robclark: pushed a new tree to "rusticl/profiling/good" which uses a persistent + coherent mapping. But on iris I sometimes get 0s as results and no idea why 🙃 wondering if things are better on freedreno. At least it does work with zink here...
14:23 robclark: karolherbst: hmm, the first version had some problems with cts test_profiling when I left a cts run going overnight... should probably figure out what is going on there before worrying about making it faster
14:33 mareko: rusticl could also target vulkan
14:37 robclark: but then rusticl would have to live with the limits of vulkan..
14:37 mareko: or limits of gallium
14:37 robclark: limits of gallium are things we can change if they matter enough
14:38 mareko: cl_khr_command_buffer is where VK would be needed
15:26 karolherbst: yeah, I was considering targeting vulkan directly from time to time, shouldn't be too hard to do, because the most difficult bits are handling shaders. Not sure how feasible it would be to make nir to spirv a bit more generaal
15:33 karolherbst: robclark: did you had a fail in execute_multipass?
15:36 robclark: karolherbst: yes, ERROR: clGetEventProfilingInfo failed! (CL_PROFILING_INFO_NOT_AVAILABLE from /home/robclark/src/opencl/cts/OpenCL-CTS/test_conformance/profiling/execute_multipass.cpp:183)
15:36 karolherbst: yeah.. that probably means it read "0" as the result
15:36 karolherbst: I suspect something about synchronization or cache flushing is going on
15:36 robclark: unless you have reason to believe your new branch fixed things, I'll look closer
15:37 karolherbst: if anything, I suspect my branch to make that issue more visible
15:37 robclark: need to double check, but the mapping will either be wirtecombine or cached-coherent.. but doesn't rule out flushing on gpu side
15:38 robclark: haven't looked closer yet, still going thru other things and absorbing caffeine this morning ;-)
16:15 pac85: https://media.discordapp.net/attachments/1139200387501531178/1472264823109718108/cornellgears.png?ex=6991f0f7&is=69909f77&hm=44a40886ddb1fdea07add1988567986aabaa2f3a76207389d12a51a6cb1423bc&=&format=webp&quality=lossless&width=805&height=805
17:28 cheako: sghuge, all: Does anyone have tips for dx12 games? I just learned about vkd3d. ^^^ In reference to ivalice chronicles.
18:15 cheako: How come I have so many versions of d3d12? https://pastebin.com/C0Eb8ryw
18:18 zf: all dlls in steam get installed (well, symlinked actually) twice per prefix
18:18 zf: once for each of 32-bit and 64-bit
18:24 zf: twice per game, I mean
18:27 robclark: karolherbst: yeah, looks like profiling fails are because res=0.. idk where waiting for result to be ready is, but that appears to be missing
18:27 karolherbst: robclark: I'm kinda inclined to think this is more of a cross context problem, but not quite sure yet
18:28 robclark: I wouldn't rule out some issue on driver side.. but curious if you can repro fails
18:28 karolherbst: I see similar behavior on iris
18:28 karolherbst: but zink seems fine
18:29 karolherbst: but zink also reads on the CPU, soo...
18:29 karolherbst: lp is fine, which isn't surprising either
18:30 robclark: I *guess* there is a missing fence wait that waits on CPU (in either old or new path) was papering over
18:30 karolherbst: compared to main my branch has like ~8% lower overhead in clpeak
18:30 karolherbst: on lp that is
18:30 robclark: not super concerned about overhead, as much as just letting the gpu and cpu work in parallel in the first place ;-)
18:31 robclark: at least at this point
18:31 karolherbst: yeah sure
18:31 karolherbst: mhhh
18:31 karolherbst: the thing is..
18:31 karolherbst: there is a thread that waits on fences that are created after the event was processed. And only if that fence wait is completed, the event will put into CL_COMPLETE state
18:32 karolherbst: and only then will the call succeed in reading a timestamp..
18:32 karolherbst: but maybe the issue is somewhere else..
18:32 robclark: hmm, status does seem to be CL_COMPLETE..
18:33 karolherbst: yeah... it's a bit weird
18:33 karolherbst: maybe the coherent mapping is not as coherent as I think it is
18:33 karolherbst: but you mentioned you see the same issue with my previous version, that wasn't doing coherent mappings yet
18:33 karolherbst: and was only mapping when reading it out
18:34 robclark: right.. haven't tried new branch yet
18:34 robclark: kinda juggling two different debugs on two branches atm
18:34 karolherbst: but I was mapping on a different context than the get_query_result_resource was executed on
18:35 karolherbst: so that doesn't change and is probably the reason of the issue
18:35 karolherbst: do I need to barrier? but a coherent mapping _should_ fix that
18:35 karolherbst: maybe
18:35 karolherbst: but the barrier also wouldn't be needed with the old version
18:36 karolherbst: let me write a gallium test app for this...
18:36 robclark: Hmm, in my case I think mapping on a different ctx should be fine.. it's all just memory.. not like this is an image that needs untiling or anything
18:37 karolherbst: yeah.. it's weird
18:37 karolherbst: like on iris it's not a reliable fail, it just sometimes fails
18:37 robclark: I wouldn't rule out the CPU seeing the fence write before the query hits memory.. but I _think_ that should be ok
18:37 robclark: yeah, well, timing ;-)
18:37 robclark: not 100% fail here either
18:39 karolherbst: well execute_multipass seems pretty reliable tho
18:39 karolherbst: at least with newest version
18:40 karolherbst: anyway. let me write asimple demo app against gallium, should be way simpler to reason about it there
18:40 robclark: sg
19:12 karolherbst: robclark: uhm.. would it be a problem if I don't set PIPE_BIND_QUERY_BUFFER ...
19:13 karolherbst: not sure if drivers care all that much about that one..
19:13 karolherbst: just one thing that I've noticed
19:13 karolherbst: ohh nvm, I actually set that one..
19:20 robclark: there is a special hack for older gens (like a2xx/a3xx IIRC) that needs PIPE_BIND_QUERY_BUFFER.. but those can't do the _result variant of queries.. or rusticl in general
19:21 robclark: err, _resource()
19:33 karolherbst: robclark: https://gitlab.freedesktop.org/karolherbst/mesa/-/commit/c48b81584291ebeb0d05ac176a48be25c8e4e082
19:33 karolherbst: sooo
19:33 karolherbst: that doens't work on iris...
19:33 karolherbst: and I'm not really seeing why not :)
19:33 karolherbst: get_query_result_resource is 0 and get_query_result has a proper value
19:34 karolherbst: it only makes sense if the fence doesn't impact queries
19:34 karolherbst: aand if queries need a different sync promitive
19:36 karolherbst: works fine on lp tho
19:37 karolherbst: do I have to use PIPE_QUERY_WAIT with pipe_query_result_resource? mhh
19:39 robclark: I think w/ qbo barriers are needed.. but in theory you shouldn't need anything if you are reading on the cpu after the fence is signaled (at least on freedreno).. hmm, but I guess I should double check that qbo tests are still passing on 8xx
19:41 karolherbst: but also.. it's a timestamp, what's there to wait on on the GPU side...
19:41 karolherbst: but using PIPE_QUERY_WAIT does make it work with iris...
19:41 karolherbst: and afaik with get_query_result_resource it's not about a CPU side wiat
19:45 robclark: right, usually qbo reads are from gpu but possibly a different stage
19:48 cheako: sghuge: I found a version of vkd3d that fixed fft.
20:01 robclark: karolherbst: hmm, PIPE_QUERY_WAIT forces cmdstream flush (submit ioctl).. which probably isn't what we want (but probably papers over issues)
20:01 karolherbst: mhhh
20:01 karolherbst: but I do flush in that test anyway...
20:05 karolherbst: anyway, we should probably decide if the code there should give me a query result or not as it is right now
20:10 robclark: I think I see a possible issue on my side.. we are probably copying from internal buffer to result rsc before the write lands to the internal buffer (at least for the end query)
20:12 robclark: (oh, and get_query_result_resource() not implemented for a5xx... which I suppose could plausibly sorta do rusticl.. I guess if someone cared enough they could copy some code from a6xx, since it should work the same way)
20:14 robclark: yeah, ok.. that was the issue
20:16 robclark: yeah, fixes all the fails on a6xx and a8xx
20:17 robclark: karolherbst: btw we should probably actually introduce PIPE_QUERY_TIMESTAMP_RAW documented to require pscreen->convert_timestamp()
20:17 karolherbst: yeah...
20:18 karolherbst: I was thinking the same, because intel already does some conversion, which isn't accurate tho
20:23 karolherbst: anyway, it doesn't explain why it's broken with iris, tho could also be a driver bug
20:24 karolherbst: mhhhh
20:26 karolherbst: iris predicates the result?
20:27 karolherbst: mhhh
20:29 karolherbst: robclark: I have a theory: with GL you never see a timestamp query with get_query_result_resource and so it's likely broken everywhere?
20:31 robclark: I think gl cts doesn't have very good coverage of it.. and in real life it is probably used mostly just for occlusion query
20:31 robclark: there might be some piglit tests I fail since not converting from ticks to ns
20:32 robclark: (some day I might get a pm4 pkt to do the conversion.. but that doesn't exist today)
20:32 karolherbst: yeah.. or that
20:33 karolherbst: robclark: I'm not a _bug_ fan of adding another query type given how drivers have their code paths for all the queries, maybe we should just add `PIPE_QUERY_RAW_VALUE` to pipe_query_flags
20:34 karolherbst: tho also that can have messy semantics with other queries...
20:35 robclark: I was kinda thinking rusticl tries PIPE_QUERY_TIMESTAMP_RAW and then falls back to PIPE_QUERY_TIMESTAMP so we don't have to implement the new query everywhere
20:36 robclark: then we just add the new query for the drivers that benefit
20:36 karolherbst: mhh that might work...
20:37 robclark: it would make sense to be part of the patch that introduces convert_timestamp().. I'll type that up later
20:37 karolherbst: okay, cool
21:34 karolherbst: yeah.. looks like with iris 25% of the CPU time spent is on creating the resource for the events mhh..
21:35 karolherbst: 90% of the ndrangekernel call... oops
21:36 karolherbst: util_vma_heap_alloc apparently is quite expensive
21:36 karolherbst: I have an idea...
21:38 karolherbst: I could just create 0x1000 sized buffers and run through them from start to end and then just throw them away and allocate new ones. No need for anything more comple
21:38 karolherbst: per queue
21:46 karolherbst: or I use the stream uploader...
21:47 karolherbst: ehh probs not
21:51 robclark: yeah, suballoc from a page size buffer if you can
21:52 robclark: that will lower cpu overhead for driver