00:02soreau: mareko: AMD_DEBUG=nodcc makes the bug worse, (many more frames are transparent textures) while AMD_DEBUG=notiling makes the bug not happen at all
00:03soreau: almost have enough to file a bug report but the way to reproduce it is kinda going out on a limb
00:04soreau: but no2d makes it worse too, huh
00:17soreau: AMD_DEBUG=w32ps makes it not happen as well
00:49soreau: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9830
04:34lina: DemiMarie: The AGX hardware is pretty "normal" as far as security goes. It doesn't do the nvidia thing of exposing MMIO directly to userspace, so you have to go through the kernel/firmware, but it still has context separation and all that, and it can actually run multiple jobs from different contexts at once securely. Most of the dumb things I run into are bad design decisions in Apple's drivers and
04:34lina: firmware... like, I think the cross-context DoS aspect is mostly just the firmware not being designed correctly, not that the hardware is broken like that.
04:36DemiMarie: lina: Makes sense. I was quite surprised that this was only DoS; “buffer control” sounds rather important!
04:36lina: Oh that one is at least somewhat worse than DoS at least to some extent, but my driver isn't vulnerable to that.
04:36DemiMarie: nice :)
04:36DemiMarie: might be able to get another bounty from Apple if you could exploit ti
04:36DemiMarie: it
04:37lina: That one I found gets accessed through privileged context #0 always, so now I just double-map it in context #0 low memory and firmware memory, with the right permissions.
04:37DemiMarie: nice
04:37lina: I actually told them about that one, they implemented a (worse than mine) workaround in 13.0, then reverted it for some reason...
04:37lina: I think they just don't care of "mostly DoS level" problems right now
04:37DemiMarie: reverted it? that sounds fishy!
04:38lina: Their workaround was mapping it in context #0 user area and actually making the firmware use that, which involved the firmware TTBR-switching into #0 to mess with it, which I guess has a performance penalty since it's a uPPL call. Or maybe it interacted badly with the other stuff they use user contexts for.
04:38lina: I suspect they just didn't think of the double-mapping solution.
04:39lina: I also told them to just outright unmap the whole firmware are from user contexts, and I used to do that, but sadly with G14X the GPU started requiring *read* access to command buffers since I suspect they're now using hardware register sequencing.
04:40lina: So now I map those read-only there, which is a bit sad but not a major infoleak (it's mostly just dumb render settings in there, the most useful info you can get out of it is framebuffer dimensions)
04:40lina: I still need to investigate whether safer workaround for that one are possible, they might well be
04:40DemiMarie: Wow!
04:41lina: A lot of this is "how much can I diverge from what macOS does and still make it work"
04:41DemiMarie: You are definitely better at writing drivers than they are.
04:44soreau: Can anyone using radeonsi on wayland run the command at the bottom of the issue and tell me if it reproduces a flickering problem? https://gitlab.freedesktop.org/mesa/mesa/-/issues/9830
04:44lina: I try ^^
04:45lina: The main DoS thing I can't fix is that if jobs are running in parallel, a fault/timeout in one often kills the others. But that's a common problem in GPU land, sadly...
04:46DemiMarie: Unfortunately so…
04:46lina: However, I can still investigate stuff there. Like, for example, there may be a way to retry jobs that are idempotent, and a lot of "simple" render jobs are idempotent (anything that starts with a framebuffer clear), so that could harden things like common compositor jobs from misbehaving apps.
04:47DemiMarie: There’s other ways not to be idempotent, like image load/store.
04:47lina: yes
04:47DemiMarie: Userspace could do that automatically, though.
04:47lina: There is actually one kind of unknown job flag where the logic we have/reverse engineered is suspiciously like "is idempotent", and I wonder if it affects firmware job recovery...
04:48lina: Job retries? No, that has to happen in firmware/kernel, because by the time userspace submits a job and gets out fences those signal successful completion of the job to the next part, there's no going back unless you break the signaling look in userspace which sucks for performance.
04:48DemiMarie: I would not be surprised at all.
04:48lina: Job faults are a stop-the-world situation, so the kernel is in a perfect spot to mess with things as long as the firmware likes what it ends up with. I just need to understand what we can do.
04:49lina: Right now we just clear the faulted job IDs and the firmware action seems to be to shunt them to complete (we mark them as failed).
04:49lina: But maybe there are ways to ask for a retry, or to mess with in-memory state to do something else.
04:49airlied: I think there a lot of reasons mentioned why kernel retries are bad plan
04:49DemiMarie: How bad is serializing all jobs such that only one executes at a time?
04:49lina: airlied: I think the issue was retries for a culprit job are bad; retries for a victim job that is idempotent aren't.
04:50airlied: that seems like it would be bad just from a throughput pov
04:50lina: Why would it affect throughput? In the normal case the codepath never gets called.
04:50airlied: lina: but you can't really tell idempotent with a shader analysis can you?
04:50lina: This only happens on faults.
04:50airlied: if you serialised all jobs
04:50airlied: always
04:50airlied: without a shader analysis
04:50lina: Oh sorry, misread
04:50lina: Yes
04:51lina: airlied: We own the shader compiler, we know what the shaders do... If there are any stores it's not idempotent.
04:51airlied: if a job has landed on the hw at all, and you it dies, resubmitting it is generally considered a bad strategy
04:51airlied: what about hw counters etc?
04:51airlied: depth buffer interactions
04:51lina: This is a job-based architecture, everything is a big load/store around jobs
04:51lina: There is no state
04:52DemiMarie: I wonder why faults wind up killing other jobs.
04:52lina: The only issue is how many tiles were partially read/written from the framebuffers, but that's why this only works if we either have separate load/store pointers (which is totally possible and would make sense for a robust mode), or if everything starts with a clear.
04:52lina: DemiMarie: I think the firmware is just dumb and does a full reset when it faults.
04:52DemiMarie: lina: ouch
04:53airlied: lina: does it include vertex and compute shaders?
04:53lina: Yes, though compute will of course do arbitrary load/store so unless you have higher level knowledge that's harder.
04:54lina: Vertex and fragment have a very well defined pipeline in simple cases.
04:54lina: In the end this would be (already is if my guess is correct) a UAPI flag, so the driver has to know.
04:54airlied: occlusion queries just seem like they could get miscounted, but maybe that won't matter
04:54airlied: or at least pipeline stats might get screwed up
04:54lina: I forget if those are cleared before everything, they probably are though?
04:55lina: I mean you're rerunning the whole job
04:55DemiMarie: does anyone care about pipeline stats except for debugging?
04:55DemiMarie: also something with stores can be idempotent if none of the stores alias any of the loads
04:55lina: Stats/perfctrs of course would change, but we don't even support those right now and... really, who cares, when the alternative is outright killing the job?
04:55lina: I mean nobody thinks killing the compositor when your app faults is a good idea, so our options right now are either just do nothing and deal with the resulting screen corruption/glitching from an aborted compositor job, or try to make it less bad...
04:56airlied: does OSX kill the compositor?
04:56lina: No, OSX glitches.
04:56lina: We've seen a bunch of that experimenting...
04:56airlied: so someone thinks it is a good idea :-)
04:57airlied: granted we'd have to make out compostiors deal with device loss
04:57lina: Actually I don't remember if I've seen the compositor glitch, but I've definitely seen other apps glitch.
04:57DemiMarie: How bad would it be to have the compositor jobs never run in parallel with other jobs?
04:57lina: It's actually possible the existing idempotent-suspect flag already works like this
04:57lina: I haven't tested it
04:57lina: If macOS always runs jobs like that with their compositor it'd be immune.
04:58lina: DemiMarie: I'm not sure we can sanely mutex those like this. It's a firmware scheduler, if we try to micromanage it it might as well not be there and we lose performance...
04:58lina: Of course there's still tons of unknowns and features that it might have that we might not know about.
04:59lina: So really the answer to a lot of these things is just to dfo more experiments and reverse engineer more of the firmware directly ^^
04:59lina: There's only so much I can do by observing the way macOS does things
04:59DemiMarie: True
05:00DemiMarie: Hopefully you don’t wind up having to ship firmware ROP chains in production!
05:00lina: I can check whether that flag is an "idempotent" flag by forcing a job kill in my python testbed and then seeing if the firmware reruns on recovery (I can observe job states over time with some polling stuff)
05:00lina: Stuff like that
05:00lina: Those experiments are really hard to run in Linux but thankfully we have m1n1 ^^
05:01lina: DemiMarie: Yeah, though I'm eyeing at least the SGX write microops... I already tried and succeeded in using them to increase the bindless sampler count from 1024 to 2048 (really dumb firmware limitation, I think Apple actually fixed it in 14.0 but we might as well support it everywhere)
05:02lina: Those are handy to get the hardware to do things the firmware doesn't support ^^
05:04DemiMarie: lina: I wonder why the firmware crashes when there are too many faults.
05:04DemiMarie: I hope it isn’t overflowing a fixed size log buffer or something bad like that…
05:04lina: Yeah... that one, I don't know if that's my fault or a race in the firmware or what. It's really tricky to debug. I don't think I've ever seen it in macOS but macOS also has very different allocation patterns to my driver, so it doesn't mean it's my bug... but it might well be.
05:05lina: Last I tried to dive into it, at least some cases looked like a command type confusion. But that could well be a caching issue that's my fault.
05:05lina: Caching has been consistently a major pain...
05:05lina: It's not a buffer overflow or anything like that.
05:05DemiMarie: That’s good
05:06lina: The crashes are often asserts where stamps are noncontiguous and things like that.
05:06lina: It's a state problem.
05:06lina: Others I've seen were definitely type confusion leading to a bad access, like it was trying to parse a command as a different type and dereferencing pointers.
05:07lina: Which suggests somewhere along the way command indexes or types got confused somewhere in some array.
05:07lina: The good thing is that faults are stop-the-world and do a full cache flush and the kernel *can* just directly write firmware .data, so if it comes to that we could just "fix" whatever is wrong in firmware memory, though it would be ugly...
05:08lina: But maybe this is all something I missed, or maybe we can fix it by only messing with shared data structures which is much cleaner.
05:09lina: Come to think of it, since faults are stop-the-world, I could try to snapshot RAM after every fault and get it to crash and then try to replay that state in my fancy javascript emulator...
05:09lina: That might get me somewhere, if the snapshotting overhead doesn't mask the heisenbug...
05:12DemiMarie: I wonder if the firmware is forgetting to disable interrupts during some non-atomic operation.
05:14lina: Yes, it could well be a race like that
05:15lina: Something like the fault handler/timeout IRQ gets invoked during a bad point
05:16DemiMarie: Anyway it’s bedtime here
05:16lina: Good night! ^^
05:16DemiMarie: Goodbye, and nice talking to you!
05:19mediaim: That's a weird content this cow reports DemiMarie, I know how aggressive thinkers some are though, like airlied, I doubt they have anything against to modernize and simplify the stack a bit, mmiotrace as a tool is also quite easy to rewrite to performing version.
05:21HdkR: If +M is missing we need more moderators in more timezones.
05:22HdkR: I volunteer for tribute since my polyphasic sleep cycle means I'm usually available.
05:25psykose: i usually don't care much for trolling because it's at least a little funny but this spam is just brain death after months of it
05:25psykose: this person is such a loser
05:27Sachiel: +M doesn't help in this case either
05:28HdkR: +M just adds a layer of annoyance which slows things down
05:28HdkR: Which can be helpful to reduce the number of interactions at least
05:29HdkR: Does akick work for non-registered users even?
05:29Sachiel: he is registered
05:29Sachiel: he always is, that's why I say +M doesn't help
05:29ccr: registering is "too easy", he always creates new users
05:30HdkR: Guess they started doing that once everyone slapped +M on the channels
05:31ccr: not a problem that could be trivially solved, unfortunately.
05:31Sachiel: nah, he always has
05:31Sachiel: he's been doing it for years
05:31ccr: closer to ten years, I'd say .. at least about as long as I've been idling on x/freedesktop related channels
05:33ccr: hmm .. that might've sounded a bit suspicious :D
05:33psykose: one of those rare situations one really wishes for that extrajudicial hit squad
05:34ccr: :/
05:34psykose: ah well
05:34psykose: it is what it is :)
06:05HdkR: No need to take the discussion that low. Going down to that level makes a fool out of everyone.
06:07dottedmag: psykose: Please feel sorry for them, they definitely have lost some gears, not their fault (most likely). I've seen people gradually descending into this kind of hell, and it's not fun.
06:45tzimmermann: javierm, could this be fallout from your FB_CORE patches: https://lore.kernel.org/lkml/202309130632.LS04CPWu-lkp@intel.com/ ?
07:13MrCooper: psykose: it's a person with mental health issues, not a loser
07:13javierm: tzimmermann: hmm, I guess it could be. Arnd also sent a fix, maybe is the same? https://lists.freedesktop.org/archives/dri-devel/2023-September/422320.html
07:16tzimmermann: javierm, i'd say that arnd 's patch is different
07:17tzimmermann: the sh4 driver needs FB=y
07:19arnd: is there anything I need to change in my patch to get that applied? I have not seen the other patch and don't know if there are still any open issues with mine
07:19arnd: it should get the defaults back to exactly what they were in 6.5
07:21arnd: ah, you mean rdd's reply, yes his patch is fine. I think what happened there is that it was broken already but the bot saw a slightly different error message after the recent changes
07:22arnd: before tzimmermann's patch, it had a link failure against a number of symbols, now it also needs fb_io_read/fb_io_write on top of those
07:23tzimmermann: arnd, feel free to ignore my comment on your patchset. it was more like a 'what if'.
07:42javierm: tzimmermann: sha4 driver requires FB=y ? But even before the FB_CORE split FB was a tristate symbol
07:43arnd: javierm: sh has endless randconfig build failures, the bot just has a database so it can ignore the known ones, but that only works as long as the error message has the same output
07:44javierm: arnd: ah, I see. Now understand your previous message about slightly different error message
07:47tzimmermann: javierm, it's FB_SH7760, which is a bool https://lore.kernel.org/lkml/feadd6a5-0f56-4575-9891-3a7d88e69e64@infradead.org/
07:47tzimmermann: not the SH_MOBILE
07:47jannau: tzimmermann: PPC dts doesn't seem to use "power-domains" properties, at least not in tree. so ofdrm is good as it is
07:48tzimmermann: jannau, thanks. i'll merge your patch then
07:48tzimmermann: picngme if I forget about is
07:48tzimmermann: it
07:48tzimmermann: 'ping'
07:48jannau: sure. thanks
07:50javierm: tzimmermann: right, typo but still my point stands. FB used to be tristate anyways so if FB_SH7760 needs FB to be built-in, then should depend on FB=y
07:50javierm: not depend on FB
07:54tzimmermann: javierm, i think this is fallout from my patch 5f86367006c6 ("fbdev/sh7760fb: Use fbdev I/O helpers")
07:55tzimmermann: fb_io_read() used to be part of fb_read(). but now it is set by FB_IOMEM_HELPERS
07:56tzimmermann: i cannot really tell why this makes a difference, but apparently it does
07:57tzimmermann: hmm, but that doesn't explain the other missing symbols, such as fb_alloc_cmap()
07:57tzimmermann: no idea than
07:57tzimmermann: then
08:19javierm: tzimmermann: let me fetch that config and see if I can reproduce it
08:39arnd: javier tzimmermann:I'm very sure it was always broken, just send a fix and don't spend too much time researching it
08:39tzimmermann: ok
08:40arnd: tzimmermann: I saw your latest cleanups for drivers/video split out the logo stuff. I was going to propose the same thing but your patch is nicer than the one I made anyway, so I'm really happy with that
08:41arnd: I also tried to untangle modedb.c as I noticed that almost all of it is only needed with CONFIG_FB=y but not with pure DRM/FB_CORE
08:41arnd: I spent way too much time on that but just couldn't figure it out so far
08:42tzimmermann: thanks
08:43arnd: I have a simple patch that adds an #ifdef CONFIG_FB around everything but the middle block (fb_var_to_videomode, fb_videomode_to_var, fb_mode_is_equal, fb_find_best_mode, fb_find_nearest_mode, fb_match_mode, fb_add_videomode, fb_delete_videomode and fb_destroy_modelist)
08:43tzimmermann: i'll look at modedb if i find time
08:43arnd: ok, great
08:44javierm: tzimmermann: do you want to post a fix making it depends on FB=y ?
08:44tzimmermann: javierm, i'm on it
08:44arnd: I'm fairly sure the remaining modedb functions I left there are not actually needed since there is no point maintaining the linked list when there is only ever one entry in it and you can't change modes from it, but it is all tangled up in fb_set_var()
08:45javierm: tzimmermann: I tried to reproduce locally but I need a superh cross-compiler since the driver depends on platform data in <asm/sh7760fb.h>
08:45tzimmermann: i've been interested in merging drm's and fbdev's modetables and cmdline code in a single place within video/.
08:45tzimmermann: i just haven't figured out how to sell that to either drm nor fbdev devs :P
08:46javierm: tzimmermann, arnd: didn't find a sh cross-compiler already packaged in fedora so gave up :)
08:46arnd: tzimmermann: my feeling was that the fbdev mode lists are a lost cause and ti would be better to just keep them away from the drm/fb_core side
08:46tzimmermann: javierm, yeah, some of these drivers are almost impossible to build
08:46tzimmermann: arnd, i'll keep that in mind
08:48arnd: tzimmermann: in particular, the mode setting through the sysfs code looks actively evil and should probably just be removed. I think there are no users because adding a modelist that way seems to add userspace pointers into kernel structures, and that crashes when you try to read it out again, at least from another process
08:48tzimmermann: javierm, for of them (ep93xx IRCC) i made it config_cross_compile and copied some headers into x86 arch. then i could partially build it for one of my patches
08:49javierm: tzimmermann: urgh
08:50tzimmermann: it was good enough to build-test that patch's change
08:50javierm: tzimmermann: yeah, I'm just thinking if pulling <asm/sh7760fb.h> would need to get more headers due transitive dependency
08:51javierm: anyways, you are on it so I leave that in your capable hands :D
08:51tzimmermann: arnd, i'm thinking of tables of displaymodes. they seem to be duplicated. but maybe it's not worth the effort anyway
08:55arnd: tzimmermann: my feeling is that the "fb_videomode modedb[]" array is too tangled up with the legacy drivers that have their own tables in platform data, so you can't easily change the code using it, and making drm use the fb table would be a step backwards for drm
09:03sima: yeah I don't think there's a benefit in trying to merge fbdev and drm mode handling, just a lot of pains
09:04sima: what might make sense is just radically cutting it if there's only drm fbdev enabled
09:04arnd: tzimmermann: something that would probably help is to try moving fb_set_var() into a separate file that is only used by fb_device or the legacy drivers, the tricky bit here is changing the callers in fbcon (fbcon_set_disp, fbcon_resize, fbcon_switch, fbcon_blank) to do a lower level operation on the fb device, at least for the drm case
09:06tzimmermann: the cost-benefit ratio might not be worth it
09:09javierm: tzimmermann, sima, arnd: with efforts like https://gitlab.freedesktop.org/plymouth/plymouth/-/merge_requests/224 and https://lists.freedesktop.org/archives/dri-devel/2023-September/422912.html
09:09sima: for fairly radical nuking making fb_info->modelist optional and then dumping all related code, with fbcon just copying over the current fbdev mode, might work
09:09javierm: hopefully we should be able to get rid of fbcon at some point
09:09sima: but only for the case where only drm is enabled as sole driver
09:10sima: yeah ideally all we keep support is the fbdev uapi on top of drm
09:10sima: because the entire fbcon locking is fairly fundamentally busted
09:12tzimmermann: drm's connector comes with a parser for the kernel's video= parameter. I once looked into moving that code into video/. we could then nuke fbdev's parsing code at https://elixir.bootlin.com/linux/latest/source/drivers/video/fbdev/core/modedb.c#L622
09:12sima: I wouldn't worry about that one either tbh
09:12tzimmermann: but i didn't fully check if both are fully compatible.
09:13tzimmermann: sima, the reason for me to consider it, would be that it would result in a single parser for that complicated option
09:14arnd: what would be the replacement for fbcon? my feeling was that /dev/fb is the one we should get rid of, not the console
09:14tzimmermann: but i never spent any time on it
09:14sima: tzimmermann, well if you make fbinfo->modelist optional with a Kconfig and then go around adding #ifdef until it compiles
09:14sima: you should get there too
09:14sima: without any risk of breaking legacy fbdev drivers
09:14tzimmermann: arnd, /dev/fb is already optional
09:15sima: arnd, /dev/fb has some apps that use it, and it's uapi
09:15arnd: tzimmermann: I know, they are both optional, the question is just which one you consider more important
09:15sima: plus it alone is fairly reasonable wrt locking/lifetime fun
09:15sima: arnd, fbcon otoh is terminally screwed up and would need a full rewrite if you want to get rid of the cve potential
09:15sima: and if you rewrite the thing, might as well implement it mostly in userspace
09:15tzimmermann: fbcon would be replaced by a simple display output for system panics and a console in userspace
09:16tzimmermann: arnd, search dri-devel for 'drm_panic'
09:16arnd: sima: my impression from /sys/class/graphics/fb0/modes is that this is also an instance cvs
09:16arnd: cvs
09:16arnd: cve
09:16tzimmermann: there's been an RFC patchset on it
09:16sima: arnd, yeah, that's why I suggested the fbinfo->modelist Kconfig
09:17sima: I think if you do that, you'd nuke the modes sysfs, the duplicated parser, and a lot of other fun stuff
09:17sima: but only works if the sole fbdev driver is drm, since that has a fixed mode :-)
09:18sima: maybe with some functions shuffled and some wrappers added it wouldn't even be that many #ifdef all over the code
09:18arnd: the problem I ran into there is setting the mode through the tty path, which then tries to look up the current mode in the list. there is probably a way to replace that, but that is the part I meant above with reworking fb_set_var()
09:18tzimmermann: from my distro POV, i'd rather just disable fbdev entirely than make it more fine-grainedly configurable
09:19sima: arnd, for drm you can just keep the current one from info->var and be done
09:19arnd: having an option to set FB_DEVICE=m might also be interesting, that way a distro can still build the module but not load it by default
09:19sima: instead of trying to match some mode
09:19sima: arnd, also if drm ever gets multiple true fbdev mode support, the proper approach would be to just wire the fbdev mode stuff through to drm
09:19sima: instead of trying to maintain 2 mode lists
09:20arnd: sima: right, that is what I had hoped, I just couldn't prove that this is the case. my initiali reading was that there is a case where you set the mode through /dev/dri/*, which then calls into the fbcon code and from there into fbmem.c, and that would fail
09:21sima: arnd, last lines in do_register_framebuffer() for what happens if you only have one mode, we essentially just put the mode from info->var into the modelist
09:21sima: arnd, nah nothing flows from dri back to fbdev, at all
09:21sima: the one fbdev mode is picked up when we register the fbdev, and fairly fake (at least for multi screen with different outputs)
09:21arnd: what happens if you set the mode on the console through dri?
09:22sima: you cant
09:22arnd: ok
09:22arnd: what about setting it through console ioctl?
09:22sima: or well you can, but the moment the console gets back into control it'll restore the drm-fbdev-emulation mode
09:22sima: it'll change the sw size, but physically nothing changes
09:22sima: fbcon just renders into a smaller part of the real framebuffer
09:23sima: arnd, drm fbdev emulation is a subordinate drm display owner (called drm master because)
09:23arnd: in that case, we can probably just blank out all the fbcon functions that call fb_set_var() and return an error from them when CONFIG_FB is disabled, and at that point eliminate all of modedb.c and fb_set_var()
09:23sima: so the moment you have a real drm master, drm-fbdev and so fbcon is not in control at all
09:23sima: arnd, ah no, because we do support change some things, just not the actual physical mode
09:24arnd: signing off for today, have to catch a flight
09:24sima: but we should be able to nuke the modelist related stuff, from a quick look nuking everything you hit when you make fbinfo->modelist optional and replacing the fbcon mode picker with the logic in register_framebuffer that just picks info->var should work
09:25sima: arnd, ttyl, and good travelling!
09:26sima: arnd, also if anyone notices the sysfs regression, implementing something non-broken for the single-mode-only case shouldn't be much code
09:27sima: and if do add drm support for picking actual physical modes then we could add fbops->fb_match_mode and ->fb_list_modes or something like that
09:27sima: that would be a bit more work, but the #ifdef will at least find everything that needs to be touched
09:36jfalempe: tzimmermann, I'm looking into reusing some fbdev drawing functions for drm_panic.
09:36jfalempe: one possible candidate is this one https://elixir.bootlin.com/linux/latest/source/drivers/video/fbdev/core/cfbimgblt.c#L215
09:37tzimmermann: jfalempe, thanks
09:37jfalempe: but that's really hard to make it independent from fbdev structs and api.
09:38tzimmermann: i'd start by trying to get rid of the fb_info. that's an fbdev framebuffer
09:38javierm: jfalempe: I wonder though if wouldn't be better to merge a minimal drm panic handler, even if only supports XRGB8888 and then as follow-up try to do that?
09:39javierm: otherwise I fear that it will take much more time to land something if untagling that is a requirement
09:39javierm: tzimmermann: ^
09:40jfalempe: Yes, that's also another approach, I think fbdev code is not ready to be used by others.
09:40tzimmermann: javierm, it would be nice to have at least an idea where to go with drm_panic before merging it
09:40tzimmermann: maybe have a todo item with some sort of roadmap
09:40jfalempe: one other thing would be to merge the cfb_xxx.c with sys_xxx.c using the iosysmap api.
09:40tzimmermann: jafalempe, i'd not want to do that
09:41tzimmermann: jfalempe ^
09:41jfalempe: drm_panic uses iosysmap, so it would be better to have the lower level handle that too.
09:42tzimmermann: jfalempe, the inner loops would have plenty 'if (is_iomem) else'.
09:42tzimmermann: not nice
09:42tzimmermann: and it requires a large refactoring of these fbdev routines
09:43javierm: tzimmermann: fair. A TODO makes sense, even mentioning that is experimental and depends on EXPERT
09:43javierm: I just think that is easier if is in the tree instead of having a patch-set inflight that takes a lot of time to land and needs constant rebasing
09:44jfalempe: Yes, that would also be easier for early adopter of drm_panic, to showcase the feature.
09:44javierm: or even depends on !VT
09:46jfalempe: drm_panic currently depends on !FRAMEBUFFER_CONSOLE, because they kind of conflicts. not sure about !VT.
09:46tzimmermann: jfalempe, javierm, give me a bit to look through the cfb_ and sys_ functions. i would assume that fucntions like fast_imageblit and slow_imageblit could be stripped from their fb_info dependencies
09:48jfalempe: tzimmermann, one other issue, is they don't handle drm format, they only use bit per pixels, so they can't handle XRGB8888 and XBGR8888
09:50sima: jfalempe, imo copypasting drawing code is entirely fine ...
09:50sima: also if you want to share code to make things work beyond xrgb8888 maybe the drm format conversion stuff might be useable?
09:51sima: my experience with fbdev code is that if you touch it, it tends to go boom in really entertaining ways
09:51jfalempe: Yes, I can probably reuse some of that too.
09:51sima: so unless we have a really solid reason to refactor it big time, just leave it as-is
09:52jfalempe: sima, I'm not really eager to touch fbdev code ;), but still wanted to take a look, as it was suggested in reviews.
09:58sima: jfalempe, I think if you do want common code, then blt functions on top of iosys_map probably makes the most sense
09:58sima: but that's not how fbdev code works at all
09:58sima: so personally I wouldn't bother
10:01jfalempe: sima, ok, I think I will just move that code to its own file (ie drm_panic.c for the panic handler and something like drm_draw.c for the drawing part).
10:35consolers: how to fix compilation of freeglut with demos=on and egl=on ?
10:35consolers: make the demos depend on epoxy?
10:49tzimmermann: javierm, jfalempe, on the format issues: we could draw in XRGB8888 unconditionally and convert via DRM's format helpers to the surface's output format
10:50tzimmermann: it's probably the easiest way forward
10:52tzimmermann: we'd need to preallocate (mode width × 4 × font height) bytes to draw glyphs in a line. then convert into the surface region
10:52tzimmermann: then do the next line of output
10:52jfalempe: tzimmermann, ok thanks, yes that's also a possibility. I'm also thinking on how to avoid having to allocate a whole framebuffer.
10:53tzimmermann: you only need a line line of text. fullhd with fontheight of 16 is (mode width × 4 × font height) = 1920×4×16 = 122880
10:54tzimmermann: 30 pages
10:54tzimmermann: and we already have a good number of conversion helpers available
10:55tzimmermann: to preallocate memory, there's drm_mode_config.max_width as a limit
10:56jfalempe: tzimmermann, thanks I'm now looking into this, it sounds much easier and feasible than fbdev.
10:59jfalempe: maybe a special case for monochrome, as the fonts are already in monochrome, no need to convert it twice.
11:02javierm: tzimmermann, jfalempe: that's a simpler approach indeed
11:06tzimmermann: there's one thing: these helpers use kmalloc. is that still available in the case of a kernel panic?
11:19pq: You might not even need a full mode width for a tmp buffer, it could be some number of glyphs and do a screen line in a few pieces. Might fit better in CPU caches too?
11:19pq: if you use the conversion helpers made for VKMS, you might want to render glyphs into the internal format directly
11:20pq: IIRC that's 16 bpc
11:26jfalempe: tzimmermann, you can still use kmalloc, but it's risky.
12:40llyyr: !24677 "radeonsi/vcn: Fix leaking fences in decode" causes issues when seeking in mpv with "--vo=gpu-next --hwdec=vaapi --video-sync=display-resample --gpu-api=vulkan". For h264 videos, incorrect frames are displayed and for h265 videos multiple frames are blended together
12:40llyyr: is this likely to be an issue with that specific mesa commit or mpv?
12:42llyyr: In particular, removing "if (fence != dec->prev_fence) return true;" from the commit fixes the issue
13:32gpiccoli: o/ tzimmermann sorry for the ping! Loong time ago we talked about potentially preserving framebuffer data, in order to reuse it in case of panic, for example. Here is the reference thread: https://lore.kernel.org/dri-devel/6d3c7acf-a23f-3073-56ed-375ccb8cc815@suse.de/
13:33gpiccoli: Also , if you/others are aware of any advance in that area (graphics output during panic), I'd appreciate any pointers.
13:34tzimmermann: gpiccoli, we talked about it today.
13:34gpiccoli: I can't believe! What a coincidence!
13:35tzimmermann: haha
13:35tzimmermann: see jfalempe's recent patchset at https://lore.kernel.org/dri-devel/0a6f869c-3c0c-8c7c-ca09-b885d9b5bafe@redhat.com/T/#t
13:35tzimmermann: it's still RFC, but things are moving
13:35tzimmermann: today's irc logs are here: https://dri.freedesktop.org/~cbrill/dri-log/index.php?channel=dri-devel&date=2023-09-18
13:36gpiccoli: that's great to hear! and this seems to be a panic handler, but nothing related with the FB saving data right?
13:36gpiccoli: I can read the logs here as well, I'm using a bouncer and can check the logs - thanks for the link!
13:37jfalempe: gpiccoli, yes this is only to display something useful for end user when a panic occurs.
13:37tzimmermann: gpiccoli, yes. the idea is that DRM install a panic handler. the handler fill ask the DRM driver to return whatever the hardware's latest scanout buffer is.
13:37gpiccoli: jfalempe, tnx a bunch for the effort! This is quite a welcome change IMO
13:37tzimmermann: s/fill/will
13:38gpiccoli: dpanic_line logo[] <- awesome!
13:38jfalempe: gpiccoli, it works only with simpledrm, but I hope to have other drivers working too.
13:38gpiccoli: so, you're using panic notfiers uh? "struct notifier_block drm_panic_notifier"
13:39gpiccoli: nice use, and I guess I'll mention that next week in my presentation if you're OK with that
13:39gpiccoli: I'll be talking exactly about these notifiers in Recipes, and the refactor we're trying to propose (and some other complexities on panic, like .. graphics!)
13:40jfalempe: yes, that's a really easy interface, you register a callback, and it's called when a panic occurs.
13:40gpiccoli: This last past ofc I don't understand much, that's the reason I asked here to see if there was some change since I (unfortunately) don't follow much the gfx lists...
13:40gpiccoli: exactly jfalempe , super easy...but I'd say... way too easy haha
13:41gpiccoli: there is no track of who registers what, the risks of the callbacks, etc. Also, there's a inherent problem in ordering the callbacks and their order vs kdump
13:41jfalempe: gpiccoli, feel free to mention it in your presentation, it's all public anyway.
13:41gpiccoli: "it works only with simpledrm" <- so if I'm using amdgpu, for example, even with oyour patches I'll see nothing in a panic?
13:42gpiccoli: simpledrm is a big step already, and tnx for that. Just curious to understand the current impl heh
13:42jfalempe: gpiccoli, yes, you should blacklist amdgpu, so it will use simpledrm instead.
13:43jfalempe: also during early boot, the kernel uses simpledrm before loading amdgpu, so if your kernel panic at this moment, you will see it.
13:43gpiccoli: ok, cool - will read the patches more carefully, but thanks a bunch for this work, panic graphics seems an area that deserves more love
13:43gpiccoli: oh yeah, or efifb right? Or not anymore?
13:44jfalempe: it depends on distribution, but the future is simpledrm.
13:46javierm: gpiccoli: for efifb you will need to make fbdev drivers to implement a .get_scanout_buffer callback
13:47javierm: but those are just platform drivers... so adding that to dev->driver would be more tricky
13:47zmike: MrCooper: feels like this might be a question for you: is there a way to block scanning a certain pci device in xorg init? or do I have to explicitly select the devices to use
13:47gpiccoli: oh sure! but I said is like..during early boot, some cases I saw efifb handling the display
13:47zmike: I have some garbage onboard vga that gets detected and disables modifier support in my xserver
13:47gpiccoli: but it's nice to see that simpledrm is the future
13:48javierm: gpiccoli: yeah, as jfalempe said depends on the distribution. OpenSUSE and Fedora moved to simpledrm already, others distros are still using {efi,vesa,simple}fb
13:49gpiccoli: interesting! tnx for the clarification folks, very useful for me!
13:49javierm: gpiccoli: for context, simpledrm just reuses whatever framebuffer is passed by the firmware/bootloader to the kernel
13:50javierm: gpiccoli: which means that supports the same than efifb, vesafb, simplefb, etc
13:50javierm: one generic system framebuffer driver to rule them all :)
13:52gpiccoli: that's really great! Lemme take the opportunity to pick y'all brain since we're talking a about that. How difficult seems to be to..kinda of having simpledrm "hijack" the display back in panic, even if the drm driver in-use was like amdgpu or i915?
13:52MrCooper: zmike: good luck, I stopped caring about Xorg a while ago
13:53zmike: 😬
13:53jfalempe: gpiccoli, I think it would be easier for i915 or amdgpu to implement the get_scanout_buffer() function.
13:53gpiccoli: so, imagine all distros move to simpledrm , so early boot is always handled by it. Then amdgpu takes over...but in panic time, having amdgou working is a pain, really a pain! So, getting back to simpledrm "saved state" would be a perfect solution
13:54gpiccoli: jfalempe, the thing is: these devices require really a lot to work, lots of interrupts handling, and panic..disables CPUs, preempt, interrupt
13:54javierm: gpiccoli: simpledrm taking back could be tricky, what I wonder though is if you can add that support to kdump
13:54gpiccoli: we'd need something really simple to avoid creating a big risk in this already delicate moment that panic is
13:54javierm: that is, kexec a new kernel and pass the existing framebuffer to the new kernel
13:54gpiccoli: javierm, I agree, exactly!
13:55javierm: then from the new kernel simpledrm, there's a system framebuffer that's provided by the "firmware" but that's the previous kernel that panics
13:55gpiccoli: than, on kdump, simpledrm is there up and running, light than any other big GPU driver as Intel's/AMD's
13:56jfalempe: yes, but you need to access the hardware to setup the framebuffer mapping. so only the amdgpu driver can do that
13:56jfalempe: on boot it's the uefi/bios firmware that prepare that.
13:56javierm: jfalempe: yeah, agreed. The native driver would have to do that preparation
13:57gpiccoli: yeah..seems tough
13:57jfalempe: after you load amdgpu. simpledrm cannot work anymore.
13:57gpiccoli: in that same thread I mentioned, it was discussed with AlexD about that, for amdgpu seems complicated to reprogram the device...
13:58jfalempe: I'm curious to see how they do that on other OS, so there is probably a way.
13:59javierm: gpiccoli: as a first test what you could do is to boot using simpledrm but no native driver (e.g: nomodeset in the cmdline), set kdump and panic
14:00javierm: gpiccoli: if you make struct screen_info to survice the kexec or the old kernel to pass that info to the new kernel somehow, then it may work with only simpledrm in 1st kernel and simple in 2nd
14:01gpiccoli: yeah, this work! I already done some tests like this
14:01gpiccoli: but IIRC, with efifb..I can have graphics in kdump
14:01gpiccoli: the thing is: I don't feel real-world users (desktop / gamers) would use simpledrm in the first kernel...maybe servers (which is already something quite nice). Hence my question..
14:02gpiccoli: if we could have a way to "forget" that users were running complex drivers and restore everything simply (using simpledrm!) in kdump, that would be wonderful
14:02javierm: gpiccoli: right, EFI would work because the EFI-GOP information is maintained in an EFI table during kexec
14:02gpiccoli: but as I say that, I can even realize how wishful thinking this seems to be heheh
14:02tzimmermann: gpiccoli, for simpledrm to work, you need hardware with a programmed display mode
14:02tzimmermann: amdgpu would provide that
14:02gpiccoli: oh javierm , good point! So simpledrm test worth a try
14:03javierm: gpiccoli: yeah because in EFI is using the same info
14:03tzimmermann: but simpledrm is a real driver. you'd have to go through the whole driver probing. that's not feasible during a panic, i guess
14:03javierm: I guess that DT platforms would also work since I assume that kdump pass the existing DTB to the new kernel?
14:04javierm: tzimmermann: yeah, I don't think is feasible to do it without kdump/kexec
14:04gpiccoli: tzimmermann, amdgpu would provide means...already provides or would *need* to provide? hehe
14:04gpiccoli: yeah javierm , agreed about DT - it makes sense
14:04tzimmermann: amdgpu already programs the mode
14:04gpiccoli: recently there's been lots of talks to preserve memory across kexec/kdump...
14:05javierm: tzimmermann: it programs yeah, but it would need to re-program to the system provided framebuffer format in order to work, right ?
14:05tzimmermann: gpicolli, with the get_scanout_buffer callback, you can get the current settings
14:05gpiccoli: oh cool tzimmermann - I tought amdgpu leaves the device in a over-complex "state" for simple scanout to work
14:05tzimmermann: and then operate on those
14:05tzimmermann: gpiccoli, there's no guarentee tha tit always works
14:06tzimmermann: but get_scanout_buffer will be implemented by amdgpu. so you could attempt to make it work in as many cases as possible
14:06gpiccoli: yeah, I understand! and not to mention..amdgpu is responsible for handling huge amount of devices, so ...maybe work in APUs and not dGPUS or vice-versa
14:06gpiccoli: great to know tzimmermann !
14:06tzimmermann: for example, on panic, amdgpu could try to program something simple to the hardware
14:06gpiccoli: this get_scanout_buffer is some new API or just regular/old one (and I'm not aware 'cause I'm far from gfx world?)
14:06tzimmermann: it's easier said than done, but still...
14:07javierm: gpiccoli: new API
14:07gpiccoli: oh yeah, panic time I guess is a no-go
14:07gpiccoli: cool! so is there a definitive patchset that I should read to understand that?
14:07javierm: tzimmermann: I guess that there are two options 1) re-program the display using the same info that was provided by the firmware (so that re-using the EFI-GOP would just work)
14:07tzimmermann: it's a single new callback that the DRM driver provides
14:08javierm: 2) Make kdump to query the info from get_scanout_buffer() and pass that to the new kernel for simpledrm to pick that up
14:08javierm: it could even be a video= cmdline param like bootloaders pass to the x86 vesa arch code
14:08tzimmermann: javierm, i'd try to return whatever is currently programmed. if that's not possible, i'd try to program something simple
14:09javierm: tzimmermann: yeah, makes sense. The question is a) how to pass to the new kernel and b) how to tell the sysfb platform code that needs to query in a different place
14:10tzimmermann: javier, gpicolli, i had some ideas specifically for kexec/kdump. get_scanout_buffer would return the current display config, which kexec would store away for the kexec kernel. the architecture's bootcode in the kexec kernel detects this and sets screen_info accordingly
14:10tzimmermann: and within the kexec kernel, simpledrm would use whatever it find in screen_info
14:11tzimmermann: my question has always been were to store the framebuffer config during for the kexec kernel
14:11javierm: tzimmermann: yup, that's what I had in mind too. Hence my comment about struct screen_info surviving a kexec
14:12javierm: but I'm not that familiar with kexec to know if that would be robust or fragile
14:12javierm: it's more safe to attempt to re-program a HW during panic or to attempt to store something in memory for the new kernel to use? That's the question I guess
14:12tzimmermann: i'm not familiar enough with kexec to answer that question (yet)
14:12javierm: tzimmermann: hehe, loved the yet :D
14:13tzimmermann: i'd say it's better to store whatever is already there
14:13tzimmermann: reprograming during panic seems very fragile to me
14:14tzimmermann: within get_scanout_buffer, the DRM driver knows if the current display mode is legitimate. but if it isn't, get_scanout_buffer fails and the kexec display remains of
14:14tzimmermann: 'off'
14:15tzimmermann: that's at least a save state
14:15tzimmermann: later within the kexec kernel, the native DRM driver would re-inint the hardware and the display comes back
14:16javierm: tzimmermann, gpiccoli: I think we need this regardless since I've seen in the past issues with kdump where display only worked when booting with nomodeset
14:16tzimmermann: javierm, TBH, when I proposed get_scanout_buffer for drm_panic, i was already eyeing towards kexec/kdump usecases
14:16javierm: tzimmermann: cool
14:34gpiccoli: sorry, was out for some mins
14:34pq: zmike, foremost I'd look in BIOS settings to disable that on-board VGA. Next I'd probably try to assign it to a non-default seat with udev rules.
14:34gpiccoli: "it's more safe to attempt to re-program a HW during panic or to attempt to store something in memory for the new kernel to use?" <- javierm , I'd say easily the latter, way safer! Since you preload the crash kernel , it's not in a bad moment as panic
14:35gpiccoli: "I think we need this regardless" <- also agree! That'd be awesome
14:36gpiccoli: I'll study a bit the get_scanout stuff, all new to me
14:37gpiccoli: Also, I'm always to discuss kdump/kexec/panic, though as I said, unfortunately I'm not gfx expert! I'll be watching this work and if you could CC me it's even better hehe
14:37gpiccoli: Quite interested on that, I can test as well...I have amdgpu APU (Steam Deck) and a dGPU, and of course, qemu heh
14:37zmike: pq: sadly the onboard can't be disabled in bios
14:37zmike: udev rules could be interesting though, I'll see what that yields
14:38gpiccoli: zmike, is it the same driver as the other gpu ?
14:38zmike: no
14:38gpiccoli: I mean, the onboard one?
14:38zmike: all my other vga devices are amd and this is aspeed
14:38gpiccoli: can't you balcklist aspeed?
14:39gpiccoli: *blacklist
14:39zmike: dunno
14:39gpiccoli: also, another alternative : https://gitlab.com/driverctl/driverctl
14:39gpiccoli: using this, you could "force" this device to bind to pci-stub
14:39gpiccoli: which...does nothing and leaves the device there, not messing with the others hehe
14:39zmike: hm
14:40gpiccoli: I'd try the module_blacklist cmdline first as it seems cheaper/quick to test
14:40pq: ID_SEAT=idontwanna with whatever the right syntax is in udev rules.conf.
14:43zmike: gpiccoli: that worked
14:43zmike: amazing
14:43gpiccoli: great zmike ! You mean, the module_blist or driverctl ?
14:43zmike: driverctl
14:43gpiccoli: I found this tool fantastic when was working with VMs
14:44zmike: it's very easy to use
14:44gpiccoli: Because you can pre-bind all devices you want to PCI-PT to vfio/pci-stub, not risking a bind to the real driver, that might init the device to a state that is hard to "un-init" heh
14:44gpiccoli: indeed! quite user friendly
15:25soreau: I can't run weston-simple-egl with zink because it segfaults with this trace: http://pastie.org/p/7nRGpjiSyIpH8YGZ8C1QBb/raw
15:28zmike: need more MRs applied
15:29soreau: zmike: is it supposed to work or has it ever worked? this is the first I've tried zink with this app
15:33zmike: it used to work
15:33zmike: and there's open MRs that make it work again
15:36soreau: gitlab search doesn't seem to work very well
15:40soreau: google is much better.. *tries MR 24700*
16:02soreau: zmike: I rebaed your egl branch on main, built it and weston-simple-egl runs but if you try to resize it, segfaults with this trace: http://pastie.org/p/6JbddPJMNtDvJWssjJfBmF/raw (same on the egl branch without rebasing)
16:04soreau: oh hm
16:05zmike: works fine here
16:05soreau: it's picking the wrong libvulkan_radeon.so
16:05soreau: even though I have LD_LIBRARY_PATH set
16:15soreau: had to set VK_ICD_FILENAMES, but it still crashes with this http://pastie.org/p/6tqXQok1L7hJhIMv7mCQa9/raw
16:45sima: jfalempe, preallocating a line and then using format helpers also has my vote as probably the most reasonable approach
16:46sima: jfalempe, you would need to make sure the line buffer is allocated at driver register time thought and probably quite a bit more than just for full hd
16:46sima: since you cannot allocate anything in panic context
16:48sima: jfalempe, since fbcon must pin an entire framebuffer just in case it should still be a win in memory consumption if you disable fbdev entirely
16:49sima: so I wouldn't worry about allocating a pretty wide line buffer
16:59jfalempe: sima, ok, I'm already rewriting the draw primitive to draw line by line.
17:45dcbaker: karolherbst: I reviewed your rust-meson patch.
17:58karolherbst: cool, thanks
18:06Lynne: apparently nvidia no longer ship an ada-only gsp firmware
18:07Lynne: they seem to be loading in the ampere firmware instead (which did work before, with some complaints)
18:07karolherbst: it should still work
19:15soreau: when I run e.g. weston-simple-egl with zink, it's like vblank_mode=0 at ~3500fps, while radeonsi runs it at the screen refresh rate
19:22soreau: with glxgears on zink, it's also the same as screen refresh
19:23soreau: I guess it's time for a 'zink is too fast' issue report
19:29soreau: zmike: FWIW, this fixes the resize crash with weston-simple-egl for me http://ix.io/4GKs
19:30soreau: I noticed that when it crashed, wsi_wl_surface_create_swapchain() was being called before wsi_wl_swapchain_queue_present() around the time is created a new swapchain, and it was using the old swapchain somehow
19:31soreau: after setting old_chain->wsi_wl_surface = NULL
20:03karolherbst: dcbaker: how can I run a single test?
20:04dcbaker: in meosn? ./run_single_test --use-tmpdir --quick test\ cases/rust/1\ basic
20:04dcbaker: (or whatever the path to your test is)
20:04dcbaker: karolherbst: ^
20:05karolherbst: thanks
21:00koike: sima robclark daniels airlied o/ I wanted to check with you about drm ci workflow, iiuc patches should land on drm-misc/drm-misc-next and maintainers of trees like msm should cherry-pick them? Or merge drm-misc-next into their tree?
21:02robclark: koike: I think it would be ok for you to ack-by drm/ci patches to land thru driver tree in cases where they are only about one driver and not going to conflict with other drm/ci patches that are in flight. For changes that impact multiple drivers, like uprev'ing igt, that should probably go via some more central tree
21:03robclark: patches landing via driver tree should have a-b from you and if you are watching other drm/ci patches you should have an idea if there is potential for merge conflicts, etc
21:03robclark: that's my $0.02 anyways ;-)
21:05koike: robclark: if the patch enables a new device I wonder if it wouldn't be nice to get it in a central place, so other patches doing core changes could validate on that device too
21:07robclark: hmm, I guess that depends on how often you expect drivers to break themselves vs core change to break them.. IME it is more often the former..
21:07robclark: but in this case, the new job is manual for now
21:07robclark: until we get some confidence in the stability of the new board farm
21:14koike: robclark: right, so I guess we can try this way
21:14airlied: koike: yeah I think new ci files and core stuff via the misc tree, and backmerge to driver tree if necessary
21:14airlied: and updates for fixes on single family should be in the driver tree
21:19koike: ok, thanks for clarifying :)
21:26karolherbst: dcbaker: I just ran into a bug writing a test: apparently you can't use the same header file for multiple rust.bindgen targets
21:27dcbaker: uhhhh, yeah. You say that and I immediately see why that would be
21:30karolherbst: anyway. pushed some changes + the test
21:33karolherbst: yeah.. you probably want to use the output file(s) as source for that target name or something
21:55dcbaker: karolherbst: ugggh, that mypy thing is real... I swear I've fixed that before
22:49soreau: zmike: this seems to work for vblank_mode and eglSwapInterval(); http://ix.io/4GLf
22:50soreau: shall I make an MR out of it?