IRC Logs of #wayland on irc.freenode.net for 2024-09-27

00:04 mclasen: I was thinking of the xeyes one that went by recently
12:01 nerdopolis: Curiosity: What would be needed in Wayland or Mesa for display servers to survive the removal of the simpledrm device when the kernel replaces it with the real card?
12:07 kennylevinsen: many display servers already support new GPUs showing up, but at least wlroots/sway doesn't support changing the device that acts as primary renderer
12:07 kennylevinsen: so it would need realize the transition is going on, throw away its current renderer and start over with the new one
12:08 zamundaaa[m]: nerdopolis: simpledrm -> real GPU driver shouldn't need changes in Wayland
12:08 kennylevinsen: yeah it's just display server logic
12:08 zamundaaa[m]: A real GPU driver going away, so that you can't use linux dmabuf anymore, that would be more tricky
12:11 nerdopolis: Yeah, I guess the core protocol is fine, I should have not phrased it that way. (Unless if clients also need to change to be aware)
12:11 kennylevinsen: well clients might want to start uisng the new renderer
12:12 kennylevinsen: if they had previously started with a software renderer
12:12 kennylevinsen: whether intentionally or through llvmpipe
12:20 nerdopolis: That would make sense. But for the ones that don't care, like say kdialog or some greeter would be fine?
12:23 karolherbst: kennylevinsen: mhhh, that reminds me of how macos handles those things. There is a "I support switching over to a new rendered" opt-in flag for applications and the OS signals to applications that this is gonna to happen (because the GPU is going away or because of other reasons), so applications get explicitly told which device to use and when.
12:24 karolherbst: otherwise they all render on the discrete one (which has its problem for other reasons)
12:37 nerdopolis: Does simpledrm even support dmabuf?
12:40 Company: switching GPUs is not really supported anywhere because it basically never happens
12:40 Company: so it's not worth spending time on
12:40 Company: same for gpu resets
12:40 DemiMarie: GPU resets absolutely happen
12:41 DemiMarie: I'm looking at AMD VAAPI here.
12:41 DemiMarie: And Vulkan video
12:41 Company: I write a Vulkan driver on AMD, I know that GPU resets do happen
12:41 Company: but I don't think I've ever had one in GTK's issue tracker for example
12:42 DemiMarie: I doubt GTK gets the bug reports
12:42 DemiMarie: Though I will happily write one for GTK not handling device loss.
12:42 Company: maybe - Mutter goes pretty crazy on GPU resets
12:42 DemiMarie: That should be fixed
12:43 nerdopolis: Switching GPUs is starting to be an issue, especially with simpledrm. It was reported against mutter first https://gitlab.gnome.org/GNOME/mutter/-/issues/2909 which was worked around with a timeout, and then sddm
12:43 DemiMarie: wlroots is working on it, as is Smithay
12:43 Company: I usually reboot when it happens
12:43 DemiMarie: KWin handles it already
12:43 DemiMarie: Company: that is horrible UX
12:44 Company: you're welcome to implement it, write tests to make sure it doesn't regress and fix all the issues with it
12:44 nerdopolis: The issue is that amdgpu takes a longer time to start up, so /dev/dri/card0 is actually simpledrm, and then the display server starts using it. Then when amdgpu (or other drivers) finishes simpledrm goes away and gets replaced with /dev/dri/card1
12:44 DemiMarie: Company: Or not use Mutter
12:45 Company: DemiMarie: yeah, if you commonly reset your gpu, that's probably not a bad idea - though I would suggest not resetting the gpu
12:45 jadahl: in mutter the intention to handle it like a gpu reset where everything graphical just starts from scratch. but the situation that simpledrm introduces is not what it is intended for, as simpledrm showing up for a short little while then gets replaced causes a broken bootup experience, so unless it can't be handled kernel side, might need to work around it by waiting a bit if we end up with simpledrm to
12:45 jadahl: see if anything more real shows up
12:45 DemiMarie: That said, it might be simpler for Mutter to crash intentionally when the GPU resets, so the user can log back in.
12:46 jadahl: because we don't want to start rendering with simpledrm, then switching to amdgpu
12:46 jadahl: (at bootup)
12:46 DemiMarie: Company: Do AMD GPUs support recovery from resets, or is it usually impossible on that hardware?
12:47 Company: jadahl: what do you do with all the wl_buffers you (no longer) have in Mutter? Tell every app to send a new one and wait until they sent you one?
12:47 nerdopolis: jadahl: But what about systems that don't have driver support, and only support simpledrm? Are they going to be stuck with an 8 second timeout?
12:48 jadahl: Company: we don't really handle gpu resets, so now we don't do anything. in a branch we do a trick for wlshm buffers to have them redraw, but it doesn't handle switching dmabuf main device etc
12:48 Company: DemiMarie: I'm pretty sure it can be made to work somehow, because Windows seems to be able to do it (the OS, not the apps)
12:49 jadahl: nerdopolis: that is the annoying part, they'd get a slower boot experience because the kernel has not the slightest clue if a gpu will show up ever after boot
12:49 Company: DemiMarie: also, when installing new drivers on Windows it tends to work
12:49 karolherbst: the one situation where changing the renderer makes sense if you e.g. want to build your compositor in the way, that you have multiple rendering context per display/GPU so you won't have to do the render on discrete GPU -> composite on integrated GPU -> scanout on discrete GPU round trip and stay local to one GPU. So if you move a window from one GPU to another, the compositor _could_ ask the applications to switch the renderer as well to
12:49 jadahl: (even if it's connected already etc)
12:49 karolherbst: save on e.g. PCIe bandwidth which is a significant bottleneck with higher resolutions e.g.
12:49 Company: jadahl: I was wondering about the dmabufs
12:49 jadahl: Company: the compositor would switch main device, and the clients would need to come up with new buffers
12:49 DemiMarie: karolherbst: how significant a bottleneck?
12:50 karolherbst: depends on a looot of things
12:50 Company: jadahl: right, so you'd potentially be left without buffers for surfaces for a (likely short) while
12:50 karolherbst: I was playing around with some PCIe link things in nouveau a few years back and I saw differences of over 25%
12:50 karolherbst: in fps numbers
12:50 jadahl: karolherbst: or gpu hotplug a beefy one
12:50 karolherbst: yeah. or that
12:51 karolherbst: but games usually don't support it, so they would just stick with one
12:51 mclasen: jadahl: you could give mutter some config to turn off the wait
12:51 jadahl: Company: indeed, one would need to wait for a little bit to avoid avoidable glitches
12:51 karolherbst: but the point is, that the round trip to the integrated GPU causes bottlenecks on the PCIe link
12:51 karolherbst: (but to fix this you'd probably have to rewrite almost all compositors)
12:51 jadahl: mclasen: how would one set that automatically ?
12:51 mclasen: you won't
12:51 zamundaaa[m]: Demi: I think Company meant to say they're writing a renderer, not a driver
12:51 DemiMarie: karolherbst: does it really take a full rewrite?
12:52 mclasen: but a user who cares about fast booting without a gpu could set it
12:52 karolherbst: not a full one
12:52 karolherbst: but like instead of having one rendering context for all displays, you need to be more dynamic
12:52 zamundaaa[m]: amdgpu does support recovering from GPU resets, though it's not completely 100% reliable
12:52 DemiMarie: zamundaaa: would you mind explaining further?
12:52 karolherbst: and that can cause quit significiant reworks
12:52 karolherbst: *quite
12:52 karolherbst: and then decide where something should be rendered where
12:52 zamundaaa[m]: Demi: in some situations, recovery just fails for some reason
12:52 jadahl: mclasen: sure. it's unfortunate this seems to be needed :(
12:52 zamundaaa[m]: I don't know the exact details
12:53 DemiMarie: I think the hard solution is the best option
12:53 mclasen: jadahl: yeah, after all these years, booting is still a problem :(
12:53 DemiMarie: At least if there is enough resources to do it.
12:53 karolherbst: but yeah.. the PCIe situation with eGPUs is even worse, because you usually don't have a x8/x16 connection, but x4
12:54 Company: random data: I lose ~10% performance by moving my GTK benchmark to the monitor that is connected to the iGPU
12:55 Company: which makes no sense because the screen updates only 60x per second anyway, not the 2000x that the benchmark updates itself
12:55 Company: but it drops from 2050fps to 1850fps
12:56 Company: probably overhead because the dGPU has to copy the frame to the iGPU
12:56 karolherbst: if rendering on the discrete GPU?
12:56 Company: yeah, GTK stays on the dGPU
12:56 karolherbst: but yeah.. if you have more load on the PCIe bus, command submission will also be slower probably
12:56 Company: it's just Mutter having to shuffle the buffer from the dGPU to the iGPU
12:57 Company: we have ~150k data per frame, at 2000fps that's 300MB/s
12:57 karolherbst: I've played around with changing the PCIe bus speed in nouveau when I've done the reclocking work. On desktops none of that mattered up, single digit perf gains at most, but on a laptop it was absoultely brutal how much faster things went
12:57 Company: actually, probably more because that's just vertex buffer + texture data, not the commands
12:57 Company: this is on a desktop
12:58 karolherbst: oh I mean desktop as in single GPU
12:58 Company: Radeon 6500 dGPU and whatever is in the Ryzen 5700G as the iGPU
12:59 Company: but the speeds for reading data from the dGPU are slooooow anyway
12:59 DemiMarie: Even when using DMA?
12:59 karolherbst: it apparently also matters enough that OS add features so that a dGPU can claim an entire display for itself so no round trips what so ever happen and it matters
12:59 karolherbst: DMA is still using PCIe
13:00 karolherbst: you can only push soo much data over PCIe
13:00 DemiMarie: Can't one usually push quite a few buffers?
13:00 Company: DemiMarie: I worked on dmabuf-to-cpu-memory stuff recently and it takes ~200ms to copy at 8kx8k image from the GPU
13:00 karolherbst: PCIe 4.0 x16 is like 32GiB/s
13:00 Company: note: *from* the GPU, not *to* the GPU
13:00 karolherbst: VRAM can be like.. 1TiB/s
13:01 DemiMarie: Company: 8K is way out of scope for now
13:01 Company: yeah, but that's still a lot less than I expected
13:01 DemiMarie: For my work at least
13:01 Company: PCIe does 32GB/s, this is more like 1GB/s
13:01 DemiMarie: Seems like a driver bug or hardware bug worth reporting.
13:02 Company: and a 30x difference in speeds is noticable
13:02 karolherbst: but yeah.. if somebody wants to experiment with splitting compositing between all the GPUs/displays and make apps not have to round-trip to the iGPU, that would be a super interesting experiment to see
13:02 DemiMarie: I mean I want to initially round-trip via shm buffers initially, because it makes the validation logic so much simpler.
13:02 Company: karolherbst: before any of that, I need to support switching GPUs in GTK ;)
13:03 karolherbst: :')
13:03 DemiMarie: But that is because I am starting with software rendering as the baseline.
13:03 karolherbst: could be a new protocol where the compositor tells clients to change them, or where it simply causes the GL context to go "context_lost" or something, but yeah....
13:04 karolherbst: I'd really be interested in anybody investigating this area
13:04 Company: the linux-dmabuf tranches tell you which GPU to prefer, no?
13:04 Company: I mean, ideally, with a supporting compositor
13:05 karolherbst: I mean as in dynamically switching
13:05 YaLTeR[m]: cosmic-comp does the split rendering from what i understand (each GPU renders the outputs it's presenting), though i believe GPU selection happens via separate Wayland sockets where a given GPU is advertised
13:05 karolherbst: like maybe you disconnect your AC and the compositor forces all apps to go the the iGPU
13:05 Company: karolherbst: I'd expect to get new tranches
13:05 karolherbst: YaLTeR[m]: ohh interesting, I should check that out
13:06 Company: YaLTeR[m]: the problem with that is that I suspect the dGPU is still faster for that output, so just using the GPU that the monitor is conencted to might not be what's best
13:06 Company: also: people connect their monitors to the wrong GPU all the time
13:07 Company: there's lots of reddit posts about that
13:07 karolherbst: yeah.. but the pcie round-trip overhead could be worse
13:07 YaLTeR[m]: it's actually not that hard to do with smithay infra (in general render with an arbitrary GPU). But it makes doing custom rendering stuff somewhat annoying
13:07 YaLTeR[m]: Company: a random half of the usb-C ports on my laptop connect to the igpu and the other half to the dgpu
13:08 YaLTeR[m]: certainly makes it convenient to test multi GPU issues :p
13:08 karolherbst: oh yeah.. my laptop is USB-C -> iGPU all normal connectors -> dGPU
13:08 Company: yay
13:08 karolherbst: there are apparently also laptops where you can flip it
13:08 Company: I only have my setup for testing
13:09 karolherbst: and then there are laptops which have eDP on both GPUs and you can move the internal display to the other GPU
13:09 Company: like what i did 10 minutes ago
13:09 karolherbst: at runtime
13:09 YaLTeR[m]: I have that too
13:09 YaLTeR[m]: Not at runtime tho I don't think
13:09 karolherbst: you can even make the transition look almost seamingless if you use PSR while the transtion is happening
13:10 Company: karolherbst: fwiw, changing GPUs in GTK would not be too hard to implement (at least with Vulkan) - but I've never seen a need for it
13:10 karolherbst: I know that some people were interested in getting the eDP GPU switch working
13:10 karolherbst: yeah.. it might not matter much for gtk apps. Maybe more for apps who also use GL/VK themselves for heavy rendering
13:11 karolherbst: and then AC -> move to dGPU, disconnect AC -> move to iGPU
13:12 Company: (there are GTK apps that do heavy rendering)
13:12 karolherbst: but my setup is already cursed an the dual 4K setup causing the iGPU heavy suffering
13:12 karolherbst: and apps just ain't at 60fps all the time
13:12 Company: that can easily be the app
13:13 karolherbst: (or gnome-shell even)
13:13 Company: because software rendering at 4k gets to its limits
13:13 karolherbst: but yeah... dual 4K is heavy
13:13 linkmauve: vnd, Weston’s current version is 14, so 8 or 10 are very out of date and likely unsupported, you probably should upgrade that first. I don’t know if the current version supports plane offloading better on your SoC though.
13:13 Company: plus, software rendering has to fight the app for CPU time
13:13 karolherbst: sure, but it's hardware rendering here
13:14 Company: hardware rendering at 4k is fine - at least for GTK apps
13:14 karolherbst: also on a small intel GPU?
13:15 Eighth_Doctor: karolherbst: the framework's all-usb-c connections make testing USB-C to GPU pretty easy
13:15 Company: it should be
13:15 karolherbst: yeah... but it isn't always here :)
13:15 Company: not sure how smallthough
13:15 Eighth_Doctor: and oh my god it's so damn hard to find a good dock that works reliably
13:15 karolherbst: well it's not terrible
13:15 karolherbst: but definetly not smooth
13:15 Eighth_Doctor: a friend of mine and I went through 4 docks from different vendors and none of them worked as advertised because of different quirks with each one
13:16 karolherbst: but a lot of it is also gnome-shell, and sometimes also just GPU/CPU not clocking up quickly enough
13:16 Eighth_Doctor: I just want a dock that works 😭
13:16 karolherbst: as they starve each other
13:16 karolherbst: but I suspect that's a different issue and not necessarily only perf related
13:16 Company: probably
13:17 Company: my Tigerlake at 4k gets around 600fps - so I'd expect an older GPU to halve that and a more demanding GTK app to halve that again
13:17 nerdopolis: I think with the simpledrm case it is somewhat harder in some ways as it's not a GPU reset, but /dev/dri/card0 just completely goes away instead
13:17 karolherbst: I think it's just all things together here
13:18 Company: if you then do full redraws on 2 monitors with it, you get close to 60fps
13:18 karolherbst: e.g. the gnome window overview puts the GPU at like 60% load, but is still not smooth
13:18 Company: I learned recently that that's usually too many flushes
13:18 karolherbst: but no idea what's going on there, nor did I check, nor do I think it's related to where things are rendered. Though I can imagine a laptop on AC doing it on the dGPU could speed things up
13:18 karolherbst: yeah.. could be
13:19 Company: solution: use Vulkan, there you can just never flush and have even more lag!
13:19 karolherbst: anyway.. I think it totally makes sense to experiment more with how this all works on dual GPU setups. The question is just how much does it actually matter
13:19 karolherbst: heh
13:20 karolherbst: like if a laptop could move entirely to the dGPU, invcluding the desktop and all displays, it could make the experience smoother on insane setups (imagine like 8K displays)
13:20 karolherbst: and also driving the internal display via the dGPU
13:20 Company: the problem with that is that you guys screwed up the APIs so much
13:20 karolherbst: heh
13:20 zamundaaa[m]: karolherbst: I'm sometimes using an eGPU connected to a 5120x1440@120 display. Without triple buffering, the experience was *terrible*
13:21 Company: that app devs don't want to touch multi-gpu
13:21 karolherbst: right...
13:21 Company: Vulkan is much nicer there
13:21 karolherbst: yeah with GL it's a mess
13:21 Company: but everyone but me seems stuck on GL
13:21 karolherbst: that's why I was wondering if a wayland protocol could help here and the compositor signals via it what GPU to use
13:22 karolherbst: and the apps only get "you recreate your rendering context now and don't care about the details"
13:22 karolherbst: and then it magically uses the other GPU
13:22 Company: what people really want is the GLContext magically doing the right thing
13:22 karolherbst: yeah.. that's somewhat how it works on macos for like 15 years already
13:23 karolherbst: they get an event telling them to recreate their rendering stuff
13:23 karolherbst: and that's basically it
13:23 Company: that's somewhat complicated though, because that needs fbconfig negotiation and all that
13:23 karolherbst: (which also means they have to requery capabilities, and because of that and other reasons it's opt-in)
13:24 karolherbst: or maybe it's opt-out now, dunno
13:24 Company: too much work for too little benefit I think
13:24 karolherbst: probably
13:25 karolherbst: but as I said, if somebody wants to experiment with all of this and comes around with "look, this makes everything sooper dooper smooth, and games run at 20% more fps" that would certainly be a data point
13:25 Company: it would
13:25 Company: and I bet it would only work on 1 piece of hardware
13:25 Company: and a different laptop, probably from the same vendor, would get 20% slower with the same code
13:25 karolherbst: maybe
13:26 karolherbst: maybe it doesn't matter much at all
13:26 Company: I think it does matter on some setups
13:26 karolherbst: and then eGPUs never became a huge think and dual GPU laptops are also icky enough so a lot of people avoid them
13:26 Company: because you want to go to the dgpu when not on battery but stay on the igpu on battery
13:26 karolherbst: yeah
13:27 Company: first I need to make gnome-maps use the GPU for rendering the map
13:27 Company: so I have something that can hammer the GPU
13:28 Company: then I'll look at switching between different ones
13:28 jadahl: karolherbst: there is already a 'main device' event that allows the compositor to communicate what gpu to use for non-scanout
13:28 Company: I don't think HDR conversions are enough
13:29 karolherbst: jadahl: that's for startup only or also dynamically at runtime?
13:30 jadahl: its dynamic
13:30 karolherbst: ahh, I see
13:31 Company: https://wayland.app/protocols/linux-dmabuf-v1#zwp_linux_dmabuf_feedback_v1:event:main_device
13:34 zamundaaa[m]: karolherbst: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/268
13:35 zamundaaa[m]: In some configurations, especially with eGPUs, the difference can be far larger than 20%
13:36 karolherbst: yeah, I can imagine
13:38 nerdopolis: Compositors might have to be changed to support the possibility of the primary GPU going away, correct?
13:40 kennylevinsen: there isn't a concept of a primary GPU on the system, but applications that depend on a GPU - the display server included - need to do something to handle it going away
13:47 nerdopolis: I'm still more thinking in the case of simpledrm going away I guess. during the transition when simpledrm goes away and the new GPU device initializes...
13:49 Company: usually there's 2 things involved: 1. bringing up the new GPU, and adapting to the changes in behavior (ie it may or may not support certain features) and 2. figuring out what to do with the data stored in the GPU's VRAM
13:52 nerdopolis: I think at least things using simpledrm should all be using software rendering, correct?
13:56 jadahl: nerdopolis: yes. i guess in theory one could have an accelerated render node with no display controller part, where one renders with acceleration, but displays it via simpledrm, but that is probably not a very common setup
14:02 nerdopolis: Probably not, simpledrm is only there if the real driver hasn't loaded yet, OR they have some obscure card that is not supported by the kernel at all. I mean I guess I never tested the bootvga driver being unsupported, with the secondary device having a valid mode setting driver...
14:18 emersion: it can happen
14:18 emersion: e.g. nouveau doesn't have support for a newer card
14:18 emersion: (yet)
14:18 emersion: in general, any case where kernel is old and hw is new
14:20 nerdopolis: Ah, that makes sense
14:33 MrCooper: geez, you leave for a couple of hours of grocery shopping, and this channel explodes :)
14:35 MrCooper: Company: we get a fair number of bug reports about AMD GPU hangs against Xwayland (and probably other innocent projects), thanks to radeonsi questionably killing the process after a GPU reset if there's a non-robust GL context
14:35 MrCooper: Company DemiMarie: AMD GPU resets generally work fine, the issue is that most user space can't survive GPU resets yet
14:36 Company: I have no idea how you'd want to handle GPU resets in general
14:36 DemiMarie: Company: Recreate all GPU-side state.
14:36 Company: like, you'd need to guarantee there's no critical data on the GPU
14:37 MrCooper: Company: mutter should copy buffer data across PCIe only once per output frame, not per client frame
14:37 DemiMarie: Exactly
14:37 MrCooper: Company: in a nutshell, throw away the GPU context and create a new one
14:37 DemiMarie: Having critical data on the GPU is a misdesign.
14:38 Company: is it?
14:38 kennylevinsen: Apps will have to rerender and submit new frames, compositor will need to rerender and have windows be black until it gets new frames...
14:38 Company: so, let's assume I have a drawing app - do I need to replicate the drawing on the CPU anticipating a reset?
14:38 DemiMarie: Company: no, you just rerender everything
14:38 Company: or can I do the drawing on the GPU until the user saves their document?
14:39 Company: I mean, for a compositor that's easy - you just tell all the apps to send you a new buffer
14:39 Company: because there's no critical data on the GPU
14:40 DemiMarie: Company: Don't drawing apps typically keep some state beyond the bitmap?
14:40 MrCooper: except there's no mechanism for that yet
14:40 Company: but if you have part of the application's document on the GPU?
14:40 kennylevinsen: even for an app I would expect that the state that drove rendering exists in system memory to allow a later rerender
14:40 DemiMarie: That would be a misdesign
14:40 DemiMarie: kennylevinsen: exactly
14:40 Company: kennylevinsen: esepcially with compute becoming more common, I'd expect that to not be the case
14:40 zamundaaa[m]: MrCooper: when a GPU reset happens, apps that are GPU accelerated will know on their own to reallocate
14:41 zamundaaa[m]: Company: in some cases, data could get lost, yes
14:41 DemiMarie: Company: then rerun the compute job
14:41 kennylevinsen: DemiMarie: I don't think it's appropriate to call it a misdesign per say, there could be uses where the caveat of state being lost on GPU reset is acceptable
14:41 zamundaaa[m]: Just like with the application or PC crashing for any other reason, apps should do regular saving / backups
14:41 kennylevinsen: I just don't expect that to generally be the case
14:41 MrCooper: zamundaaa[m]: the app needs to actively handle it, the vast majority don't
14:41 DemiMarie: Generally, you should preserve the inputs of what went into the computation until the output is safely in CPU memory or on disk
14:41 kennylevinsen: "safely in CPU memory" heh
14:41 DemiMarie: Which is a bug in most apps
14:42 zamundaaa[m]: MrCooper: yes, but in that case, requesting a new buffer is useless anyways
14:42 DemiMarie: kennylevinsen: you get what I mean
14:42 MrCooper: that's a separate issue
14:42 DemiMarie: So yes, GTK should be able to recover from GPU resets.
14:42 zamundaaa[m]: how so? If the app handles the GPU reset, it can just submit a new buffer to the compositor after recovering from one
14:43 MrCooper: zamundaaa[m]: if the compositor recovers from the reset after a client, it might not be able to use the last attached client buffer anymore, in which case it would need to ask for a new one
14:43 zamundaaa[m]: The only reason some compositors need to request new buffers from apps is that they release wl_shm buffers after uploading the content to the GPU
14:44 zamundaaa[m]: MrCooper: right, if the compositor wants to explicitly avoid using possibly tainted buffers
14:44 zamundaaa[m]: or rather, buffers with garbage content
14:45 MrCooper: that's not the issue, it's not being able to access the contents anymore
14:45 zamundaaa[m]: MrCooper: it can access the contents just fine
14:45 MrCooper: not sure it's really needed though, keeping around the dma-buf fds might be enough
14:46 zamundaaa[m]: Not the original one, if the buffer is from before the GPU reset, but it can still read from the buffers and get something as the result
14:46 DemiMarie: zamundaaa: is that true on all GPUs?
14:46 DemiMarie: I would not be surprised if that just caused another GPU fault.
14:47 Company: my main problem is that GTK wants to use immutable objects that do not change once created - and a GPU reset changes those objects
14:47 Company: so now you need a method to recover from immutable objects mutating
14:48 Company: which is kinda like wl_buffer
14:48 DemiMarie: Company: Can you make each object transparently recreate its own GPU state?
14:48 Company: which is suddenly no longer immutable either because the GPU just decided it's bad now
14:48 DemiMarie: Or recreate everything from the API objects?
14:49 Company: DemiMarie: not if it's a GL texture object
14:49 Company: and no idea about dmabuf texture objects
14:49 DemiMarie: Company: those are not immutable
14:49 Company: and even if I could recreate them, they'd suddenly have new sunc points
14:49 DemiMarie: anything that is on the GPU is mutable
14:49 kennylevinsen: the world would be much nicer if GPUs didn't reset :)
14:50 Company: DemiMarie: those are immutable in GTK per API contract - just like wl_buffers
14:50 DemiMarie: Company: that seems like an API bug then
14:50 llyyr: you dont need to deal with gpu resets if you don't do hardware acceleration
14:50 DemiMarie: Apps need to recreate GPU buffers if needed
14:50 psykose: you don't need to deal with software if you don't have hardware yea
14:51 DemiMarie: kennylevinsen: I believe Intel GPUs come close. They guarantee that a fault will not affect non-faulting contexts unless there are kernel driver or firmware bugs.
14:51 zamundaaa[m]: Demi: I *think* that it's a guarantee drivers with robust memory access have to make
14:51 Company: DemiMarie: that's the question - you can decide that things are mutable, but then suddenly eveyrthing becomes mutable and you have a huge amount of code to write
14:52 DemiMarie: Company: that seems like the price of hardware acceleration to me
14:52 kennylevinsen: DemiMarie: I imagine the cause of resets is generally such bugs, so not sure how helpful that guarantee is
14:52 kennylevinsen: but amdgpu does have above-average reset occurrence
14:52 Company: DemiMarie: same thing about mmap() - the kernel could just mutate your memory and send you a signal so you need to recreate it - why not?
14:52 DemiMarie: kennylevinsen: On many GPUs userspace bugs can bring down the whole GPU.
14:53 DemiMarie: Company: because CPUs provide proper software fault containment
14:53 Company: DemiMarie: I don't think that's a useful design though - I think a useful design is one where the kernel doesn't randomly fuck with memory
14:53 Company: DemiMarie: so make it happen on the GPU
14:54 DemiMarie: Company: Complain to the driver writers and hardware vendors, not me.
14:54 Company: I am
14:54 DemiMarie: Via which channels.
14:54 DemiMarie: ?
14:55 Company: but I think it's fine if I just write my code assuming those things can't happen and wait for hardware to fix their stuff
14:55 kennylevinsen: Company: currently, resets happen Quite Often™ on consumer hardware
14:55 Company: instead of designing an overly complex API working around that misdesign
14:55 kennylevinsen: so you probably have to expect them for now
14:55 DemiMarie: I can say that on some GPUs, you may be able to get that guarantee at a performance penalty, because no more than one context will be able to use the GPU at a time.
14:56 Company: kennylevinsen: not really - people complain way more about other things
14:56 DemiMarie: kennylevinsen: how often do you see them on Intel?
14:58 Company: there's also this tendency of the lower layer developers to just punt all their errors to the higher layers and then blame those devs for not handling them
14:58 Company: which is also not helpful
14:58 kennylevinsen: there is indeed an issue with hardware issues getting pushed to software, but we tend to get stuck dealing with the hardware we got as our users have it
14:58 Company: "the application should just handle it" is a very good excuse
14:59 Company: my favorite example of that is still malloc()
15:00 kennylevinsen: DemiMarie: I'm not a hardware reliability database - anecdotally, I have only seen a few i915 resets, but have had on amdgpu where opening chrome or vscode would cause a reset within 10-30 minutes, which was painful before sway handled resets
15:00 DemiMarie: Company: GPU hardware makes fault containment much harder than CPU hardware does.
15:00 DemiMarie: kennylevinsen: that makes sense
15:01 kennylevinsen: s/have had/have had periods/
15:01 Company: on Intel, a DEVICE_LOST because of my Vulkan skills don't reset the whole GPU
15:01 Company: on AMD, a DEVICE_LOST makes me reboot
15:01 DemiMarie: Company: that's what I expect
15:01 kennylevinsen: here, DEVICE_LOST just causes apps that don't handle context resets to exit
15:02 zamundaaa[m]: kennylevinsen: about "amdgpu does have above-average reset occurrence", not so fun fact, amdgpu GPU resets are the currently third most common crash reason we get reported for plasmashell
15:02 kennylevinsen: but to amd's credit, resets have reduced and they also appear to resort to context loss less often?
15:03 kennylevinsen: zamundaaa[m]: dang
15:03 Company: dunno, I write code that doesn't lose devices
15:03 Company: I don't want to reboot ;)
15:03 zamundaaa[m]: kennylevinsen: in my experience, GPU resets are at least recovered from correctly lately
15:03 zamundaaa[m]: While plasmashell may crash, KWin recovers, and some other apps do as well
15:04 DemiMarie: zamundaaa: what are the first two?
15:04 kennylevinsen: zamundaaa[m]: it was a huge user experience improvement when sway grew support for handling context loss
15:04 zamundaaa[m]: Xwayland's the bigger problem
15:04 zamundaaa[m]: Demi: something Neon specific, and something X11 specific
15:05 kennylevinsen: hmm yeah, losing xwayland is more jarring even if relaunched
15:05 DemiMarie: Company: in theory, I agree that GPUs should be more robust. In practice, GTK should deal with device loss if it doesn't want bad UX.
15:05 zamundaaa[m]: kennylevinsen: it's worse. Sometimes KWin hangs in some xcb function when Xwayland kicks the bucket
15:05 DemiMarie: zamundaaa: Neon?
15:06 kennylevinsen: oof
15:06 Company: DemiMarie: I think it's not important enough for me to care about - and I expect it to get less important over time
15:06 zamundaaa[m]: Demi: KDE Neon had too old Pipewire, which had some bug or something. I don't know the whole story, but it should be solved as users migrate to the next update
15:06 Company: DemiMarie: but if someone wants to write patches improving things - go ahead
15:09 Company: same thing with malloc() btw - Gnome still aborts on malloc failure just like it did 20 years ago. I'm sure that could be improved but nobody has bothered yet
15:10 DemiMarie: zamundaaa: AMD is working on process isolation, which will hopefully make things better, but it will be off by default unless distros decide otherwise.
15:11 Company: AMD turns everything off that may make fps go down
15:13 nerdopolis: I feel like the case where the main driver is slow to load, and the login manager greeter display server uses simpledrm, and then gets stuck in limbo when the kernel kicks it out is starting to be more common too
15:14 nerdopolis: kwin handles it the best because of kwin_wayland_wrapper, (but only Qt applications so far)
15:15 nerdopolis: other display servers like Weston, they hang when I boot with modprobe.blacklist virtio_gpu have them start with simpledrm and then modprobe virtio-gpu
15:16 kennylevinsen: amdgpu being as slow to load as it is should also really be fixed...
15:19 jadahl: kennylevinsen: I'd also like a generic "i'm gonna start trying to load a gfx driver now" signalling so one can conditionalize waiting in userspace on that
15:20 jadahl: but it seems non-trivial to make such a thing possible
15:23 MrCooper: Company: malloc never fails, for a very good approximation of "never"
15:24 DemiMarie: kennylevinsen: what makes it so slow?
15:24 kennylevinsen: you'd have to ask amdgpu devs that question
15:24 kennylevinsen: firmware loading perhaps?
15:25 Company: MrCooper: that took a while though - 25 years ago when glib started aborting, malloc() did fail way more
15:26 MrCooper: DemiMarie: one big issue ATM is that the amdgpu kernel module is humongous, so just loading it and applying code relocations takes a long time
15:27 MrCooper: Company: I'll have to take your word for it, can't remember ever seeing it fail in the 25 years I've been using Linux
15:27 MrCooper: of course one can make it fail by disabling overcommit, that likely results in a very bad experience though
15:28 DemiMarie: MrCooper: why the kernel module so large? Is it because of the huge amount of copy-and-pasted code between versions?
15:28 Company: MrCooper: when I worked on GStreamer in the early 2000s, I saw that happen sometimes
15:28 Company: also because multimedia back then took lots of memory
15:28 MrCooper: DemiMarie: mostly because it supports a huge variety of HW
15:29 Company: *lots of memory relative to system memory
15:30 DemiMarie: Also I wonder if in some cases the initramfs will only have simpledrm, with the hardware-specific drivers only available once the root filesystem loads.
15:38 nerdopolis: I think there is one distro that actually does that with the initramfs, I could be wrong
15:40 nerdopolis: DemiMarie: And I think it does it if the volume is not encrypted or something, so it's not all installs. I think its ubuntu, but don't quote me on that
15:49 nerdopolis: Yeah it is Ubuntu that sometimes doesn't load modesetting drivers in the initrd https://github.com/systemd/systemd/issues/3259
17:42 wlb: wayland-protocols Merge request !342 opened by () governance: introduce workflow improvements https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/342 [governance], [In 30 day discussion period]