00:12 karolherbst: I don't want to spin at all
00:15 HdkR: The offer still stands for when the spins inevitably happen :D
04:52 cheako: There is a lot more that isn't drawn in the game as a whole and renderdoc says vulkan is just handed the rendered data... makes sense if it's a psp source port, there probably is some CPU render pipeline that mimics the 2004 handheld.
05:34 cheako: The combat summery screen is missing enough to make the game unenjoyable. There were a lot of intel users reporting "aura" glitches, but I haven't seen any.
13:45 cheako: I'm the 609th person to open a ticket with square.
13:45 cheako: They had an AI replying to me.
14:20 robclark: karolherbst: btw re cl_khr_subgroups, I think CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS is optional? At least a6xx blob cl advertises cl_khr_subgroups but not CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS
14:25 robclark: hmm, seems like there are conflicting statements
14:54 robclark: karolherbst: looks like cl cts only requires that it returns CL_FALSE if subgroups are not supported.. so I think we can expose cl_khr_subgroups as-is
14:54 robclark: (I mean, if the gallium driver supports subgroups)
15:01 cwabbott: robclark: the CL/VK subgroups and GL subgroups are a bit different
15:02 cwabbott: CL/VK uses int128 and has the special ballot operations which are specialized by the driver based on the max ballot size
15:02 cwabbott: the legacy GL extension uses int64 and assumes that subgroups are at most 64 invocations
15:03 robclark: sure, but that isn't about IFP
15:03 cwabbott: it means that freedreno can't expose subgroups
15:03 cwabbott: since it can't expose the legacy GL sugroup extension
15:04 cwabbott: at least not on a6xx-a7xx, since wave128 is a thing and it doesn't understand how to limit to 64 when subgroups are in use
15:05 robclark: Hmm, I care a bit more about this in the context of CL..
15:05 cwabbott: yeah, but it's one gallium property iiuc
15:05 cwabbott: there probably has to be some plumbing in gallium/NIR to be able to tell them apart
15:05 robclark: but I think we could restrict to wave64 if it is not a compute-only context and subgroups are used
15:05 cwabbott: or that, yeah
15:07 cwabbott: I don't remember if we ever exposed new-style subgroups in GL
15:08 cwabbott: oh no, we did
15:10 robclark: I guess zink ignores this, IIRC
15:13 cwabbott: seems like zink on turnip will return shader_subgroup_size = 128
15:14 cwabbott: oh, but it's also conditioned on shaderFloat64, which ofc we don't expose...
15:14 cwabbott: so I guess we don't expose subgroups
15:15 cwabbott: ok, but it doesn't expose shader_ballot unless subgroupSize <= 64
15:18 cwabbott: seems like there are separate caps for legacy ballot and subgroups
20:37 robclark: karolherbst: https://gitlab.freedesktop.org/robclark/mesa/-/commits/rusticl/profiling .. rebased your branch and added bit to convert_timestamp().. seems to work..
20:37 robclark: although I noticed asahi implements get_query_resource() on CPU for timestamp queries.. but maybe that is not worse than previous state?
20:37 karolherbst: cool, does it also improve throughput?
20:38 karolherbst: ohh yeah.. drivers implementing get_query_resource on the CPU should be just as bad as get_query here
20:38 karolherbst: *get_query_result
20:38 robclark: seems to improve things (but also with other opts I had rebased onto that branch)
20:39 robclark: need to double check that we aren't ending up with more SUBMIT ioctls/etc.. I'll check that after lunch
20:41 karolherbst: I was considering creating a resource with consistent + coherent mapping and slab allocate it with vm_heap (or a proper slab allocator), so I don't have to map+unmap all the time or create resources permanently
20:44 robclark: that probably doesn't matter much for iGPU.. idk about for vram. But I guess reading the query result probably isn't critical path
20:45 robclark: idk if you tried clpeak with that change, but I guess it would help iris/radeonsi
20:51 karolherbst: yeah.. the purpose of this would only to reduce CPU overhead a little by skipping all the gallium API calls
20:52 karolherbst: and probably also RAM usage
20:53 karolherbst: though the latter is debatable, because I still have to allocate memory for the slabs...
20:58 robclark: karolherbst: I think maybe suballocate for the Event result buffer? Or are you already doing that somewhere?
20:59 karolherbst: most drivers already sub allocate small allocations
20:59 karolherbst: so I'm more concerned about the constant map+unmap dance I'm doing here and rather have a persistent maapping + offset thing
21:05 jannau: robclark: I'm in the progress of fixing asahi
21:05 robclark: I cache the mmap so that should be free.. but suballocated buffers get aligned to (IIRC) 64B
21:05 robclark: ahh, nice
21:06 robclark: jannau: https://gitlab.freedesktop.org/robclark/mesa/-/commit/887e076bd3b2cf2af5115247b519f8d6a53b8a38 gives a non-compute-shader way to do it (I'm assuming asahi has the same issue as adreno coverting ticks to ns)
21:07 jannau: although I'll miss the reported throughput of 220200 GB/s or so for large buffer copies in gpu-ratemeter gl.bufbw
21:08 robclark: heh
21:09 HdkR: At least newer Apple Silicon can just run its cycle counter at 1Ghz as required by ARMv9.1 :D
21:09 HdkR: cycles to ns is easy at that point
21:10 robclark: but what about whatever the gpu uses?
21:10 jannau: the kernel driver already handles timestamp conversion and in principle the mesa driver tries to do the right thing and attaches timestamp queries to outstanding render jobs
21:11 HdkR: I'm hoping it's the same counter, but that's speculating
21:11 jannau: the broken case are submitted but not yet completed jobs
21:12 jannau: GPU and AP use the same counter
21:12 jannau: as far as we know
21:12 robclark: jannau: oh, are you not reading timestamp on the GPU?
21:17 karolherbst: robclark: I wonder if the kernel launch latency also gets reduced with those patches
21:18 karolherbst: I was looking into it on iris, but then fixing the value on the GPU turned out to be more expensive, because the macro thing you can use on the command processing can't do 64 bit divisions/mult and it's a lot of pain and the macro got built every call :)
21:18 karolherbst: but I think with your callback idea that's going to be a lot smoother
21:18 jannau: in the case of already submitted jobs the driver just ask for the current GPU time without fences. if it knows beforehand timestamps are tied to job execution and filled by the GPU
21:18 karolherbst: on my system the query part was like 30% of the CPU cycles spent on that `--kernel-latency` clpeak test
21:19 robclark: karolherbst: it did go down, but I'd need to pin cpu freq to make sure that isn't just cpufreq
21:19 karolherbst: yeah fair
21:19 karolherbst: sadly the latency is just impressively low on ROCm and Intel NEO, because of usermode command submission :')
21:20 karolherbst: they hit like <10 us
21:21 robclark: hmm, actually looks like it got worse.. although it is low enough that we could just be measuring the time for the extra CP cmds to copy the result
21:22 karolherbst: it uses the queries result
21:22 robclark: I mean, the driver has an internal buffer
21:22 robclark: most queries (other than timestamp) store a start and end value
21:22 karolherbst: ahh..
21:23 robclark: so there is an extra thing to do on the CP for get_query_result_resource()
21:23 karolherbst: mhhhh
21:24 karolherbst: would be there a more efficient way to get it?
21:24 robclark: with gallium api changes, I guess
21:24 karolherbst: can always add a new query type
21:25 robclark: ie. we need to know the address to write to when the query is started/created
21:25 karolherbst: I see
21:25 karolherbst: so each query has its own buffer and then get_query_result_resource just copies between the internal and the buffer given by get_query_result_resource
21:26 robclark: anyways, I wouldn't worry about that yet.. we can optimize later.. either way if you aren't pinning the freq then the latency is much higher (ie. with schedutil) unless the app is actually doing things
21:26 robclark: right
21:27 karolherbst: could maybe change create_query to allow handing in a pipe_resource + offset
21:27 karolherbst: or make it a new object
21:28 robclark: maybe.. I don't want to get too ahead
21:28 karolherbst: yeah.. should investigate first how other hardware works there
21:28 robclark: it does seem like we end up w/ more GEM_SUBMIT ioctls, so maybe there is something completely different going on
21:29 robclark: at any rate, I'd say this is a pretty big improvement, since we can actually pipeline gpu and cpu now
21:29 karolherbst: yep
21:29 robclark: so I'd say put up an MR
21:29 robclark: we might find things to optimize later.. if we do, it's all one git tree and we can make changes to gallium or whatever is needed
21:31 karolherbst: yeah.. there are bigger things I'm worry about in regards to gallium API changes, e.g. cl_khr_command_buffer :')
21:32 robclark: shudder
21:33 karolherbst: yeah....
21:34 robclark: "On embedded devices where building a command stream accounts for a significant expenditure of resources.." ... setting up and running a compute shader is so simple compared to anything 3d..
21:34 karolherbst: aanyway yeah, should probably be able to create a proper MR with convert_timestamp, just want to make sure it's also working properly on zink, iris, radeonsi and asahi
21:35 karolherbst: oh yeah
21:35 karolherbst: but
21:35 karolherbst: CL requires API usage validation
21:35 karolherbst: and with command buffers you can also eliminate that
21:36 robclark: write a layer.. and cl_yolo_no_error_checking?
21:36 karolherbst: there is already a layer
21:36 karolherbst: for command buffers I mean
21:37 karolherbst: but yeah.. I was considering bringing up moving validation into the ICD loader and such...
21:37 karolherbst: or make it a mesa thing first like the no error thing we had for GL
21:37 karolherbst: but even a native impl of cl_khr_command_buffer in rusticl would be able to lower CPU overhead
21:38 karolherbst: I want to think about command reordering + cl_khr_command_buffer at some point, because they kinda interact from an optimization pov
21:38 karolherbst: but also not sure how doing native command buffers will work out...
21:38 karolherbst: might just make it a zink only thing...
21:59 robclark: karolherbst: random thought, most things are already CSO's which are pretty easy to emit.. the thing that isn't a CSO is pipe_grid_info.. if that could be re-used then pctx->launch_grid() becomes writing a handful of pointers to pre-baked state..
23:21 karolherbst: robclark: mhh yeah, I think we can definietly improve on the compute path. The way that global buffers are "bound" also suck
23:21 karolherbst: but.. launch_grid also allows bindless global buffers now
23:22 karolherbst: maybe we could have a launch_grid2 call that encodes the entire state in an object or as parameters
23:22 karolherbst: but not quite sure how that all works out across hw vendors
23:23 karolherbst: maybe I should allow drivers to set set_global_binding to NULL and then rusticl always uses the bindless global buffer path, which is currently only used for BDA and SVM
23:24 karolherbst: that's really where a lot of the CPU overhead comes from at least on rusticls side
23:26 karolherbst: another aspect is to pre built the kernel input buffer, atm I rebuild it every launch, but I was also considering finding a way to either cache it or just update it alongside kernel args, but set_global_binding also makes that more difficult..