IRC Logs of #wayland on irc.freenode.net for 2023-07-17

04:11 kchibisov: emersion: you could create a surface of size 1x1, then you resize to 800x600 with scale 2, then you swap buffers -> you crashed, because you've commited 1x1.
04:13 kchibisov: Or, you have 2 windows rendering at your application, each window has the same context, you get a resize with the scale of 2, you call eglMakeCurrent, you resize, you apply scale -> you've crashed, because you've commited the old size with scale 2.
04:13 kchibisov: Because eglMakeCurrent latched the buffer.
04:15 kchibisov: So unless you study mesa's EGL code and bugs you'll have a very good chance to get crashes.
06:38 emersion: kchibisov: I don't understand what this buffer latching is about
06:47 kchibisov: emersion: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6818
06:48 kchibisov: I think this issue has all others linked.
06:49 kchibisov: As well as https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18984
06:49 kchibisov: This is all relevant for egl, not for vulkan iirc.
06:50 wlb: weston/11.0: Michael Tretter * backend-drm: schedule connector disable for detached head https://gitlab.freedesktop.org/wayland/weston/commit/a5d52075a07b libweston/backend-drm/ drm-internal.h drm.c kms.c
06:50 wlb: weston Merge request !1319 merged \o/ (Schedule connector disable for detached head for weston 11.0 https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/1319)
06:56 kchibisov: Though, the toolkits could workaround that by setting viewporter dst_size to the one they've resized. So when everything matches compositors will ignore, but if you got a missmatch due to EGL stuff it'll 'amortise' the issue. Though, I'd rather have mesa fixed wrt that context stuff.
06:59 emersion: i still don't understand why locking/unlocking buffers has any impact on what happens on the wire Wayland-wise
07:00 kchibisov: The user might think that they've resized the buffer to account for the new scale.
07:00 kchibisov: But in reality they don't.
07:00 kchibisov: However the scale will be applied.
07:01 emersion: so, makecurrent, resize, set_scale, swapbuffers, and you got a wrong size?
07:01 kchibisov: So if you do (your surface was 501x401) eglMakeContextCurrent(), wl_egl_resize_surface(800x600), wl_surface.set_buffer_scale(2), eglSwapBuffers() -> crash, 501 not dividable by 2.
07:01 kchibisov: yes.
07:01 emersion: why do you resize while the context is current?
07:01 emersion: it sounds purely like a client design issue
07:02 kchibisov: If you have 2 windows at the same time one could assume that to start working with the surface you need to make it current.
07:03 kchibisov: That's what Qt for example did.
07:03 kchibisov: And that's what I've assumed until I've read all the mesa's wayland egl code.
07:04 kchibisov: And you can clearly resize once the context is current, you just can't call the function to make it current.
07:05 kchibisov: One other case, is if you don't do eglMakeCurrent, but do wl_egl_create_window.
07:05 kchibisov: Because it'll do the same to the actively current context.
07:06 kchibisov: Hm, maybe it was for eglCreateContext actually, so if you have a current context and you create a new one, it'll latch the current one.
07:07 kchibisov: One could make context current to compile shaders for example, then make it not current, but the buffer will still be latched, because you've called eglMakeCurrent.
07:16 MrCooper: "latched" as in "attached and committed"?
07:17 kchibisov: latched as in resizing won't apply until the eglSwapBuffers is called.
07:17 kchibisov: So your resize will go into the next frame.
07:18 kchibisov: But your scale can go to this frame.
07:18 kchibisov: Because scale is set by client, but resize is done by mesa.
07:19 wlb: weston Issue #778 opened by Hyomin Kim (hyoputer.kim) The mouse button release signal is not called after moving the surface https://gitlab.freedesktop.org/wayland/weston/-/issues/778
07:20 MrCooper: so the client must only change buffer scale if it also calls eglSwapBuffers?
07:20 kchibisov: They must change the buffer scale before the buffer gets latched.
07:21 kchibisov: So before the operations listed in egl spec (buffer age, etc), eglMakuCurrent(probably a bug), eglCreateContext(if there's any other current context on the calling thread, probably also a bug).
07:22 kchibisov: The client though, could ensure that their size actually applied by querying the size.
07:22 kchibisov: So they could try to resize, check if they resized, and if the sizes matches they could set buffer scale.
07:27 kchibisov: Maybe I should stop caring about that and tell every 'my egl application crashes, your toolkit is bugged' bug report that it's not my issue and close them.
07:28 kchibisov: Once fractional scaling will be more adopted it won't be a thing anyway.
07:28 emersion: it is a client bug though
07:28 emersion: and i will not stop crashing your client
07:29 kchibisov: My client is fine.
07:29 kchibisov: Because I know how to write and when it'll latch.
07:30 kchibisov: But I still believe that eglCreateContext must not latch.
07:30 kchibisov: Because it doesn't even make any sense for it to latch other arbitrary context.
08:45 emersion: jadahl: would it be possible to get this merged? https://github.com/flatpak/flatpak/pull/4920
09:08 pq: _DOOM_ should not use libwayland-server as a generic utility library because if those utilities are lacking in some way, enhancements will probably not be accepted. That goes especially with the event loop stuff, and the other stuff is likely frozen by ABI anyway. So it's a risk having to use something else in the future anyway as your requirements grow.
09:11 jadahl: emersion: i'm not a flatpak maintainer, I can only try to nag again
09:17 pq: kchibisov, you have to be really *really* careful with EGL in order to be sure of the buffer size after resize is what you expect. Once EGL or GL internally needs a buffer, it's size is locked until the next swapbuffers. Use wl_egl_window API to query it, I think? Or always wl_egl_window_resize() immediately after eglSwapBuffers and live with that.
09:18 kchibisov: pq: I know, I'm just afraid that agressive 'kill the client logic' could make more harm with egl.
09:19 kchibisov: Like you update you kwin, and then none of the Qt apps launch anymore.
09:19 kchibisov: Because they have broken EGL handling.
09:19 kchibisov: Which worked for them on anything other than Wayland.
09:20 kchibisov: pq: my issue though, is not gl operations, but egl operatons which don't do any rendering or affect the buffer.
09:22 kchibisov: At least in the past daniels said that it's a 'mesa issue' here (https://gitlab.freedesktop.org/mesa/mesa/-/issues/6547 , last comment).
09:24 pq: right, you seem to already know everything there is about this
09:25 kchibisov: If we start saying that it's not a mesa issue, then we must update the spec one more time.
09:26 pq: Unfortunately there is no other way to inform clients they are doing glitchy things than outright protocol error.
09:26 kchibisov: I mean, it's fine to do a protocol error.
09:26 pq: It is a Mesa issue if EGL API calls lock the back buffer with no reason to.
09:27 kchibisov: it's just maybe not a right time for due to known bugs in drivers.
09:27 pq: But EGL API calls that actually do need a locked buffer, it's not a bug.
09:28 kchibisov: pq: I agree, and I'm saying that we have a situtation with 1)(mesa latches where it shouldn't) which affects Qt.
09:28 pq: Qt? You were not talking about winit or glutin?
09:29 kchibisov: I was talking about the situation we have in general.
09:29 kchibisov: My code is resistant to that issue, because I'm well aware how it works.
09:29 pq: ok
09:30 kchibisov: But if you look into https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18984
09:30 kchibisov: You'll see Qt bug being linked.
09:30 kchibisov: While it's not an issue right now, because you can't possibly crash due to that with the scaling from the wl_output.enter, once the wl_surface::preferred_buffer_scale will be used it'll start crashing in addition.
09:32 kchibisov: Simply because you render the first frame with the scale of 1 and scaling of 1 + broken size is not a real issue other than a 'glitch', but with the new event, it'll be delivered to you along the first configure, you'll apply that scaling and crash if the latched buffer was not dividable.
09:32 pq: No, you definitely should crash regardless of where you got your scale factor from.
09:32 kchibisov: I mean, that with legacy you render first frame at scale of 1.
09:33 pq: The protocol error is that the client is internally inconsistent: it promises the buffer size + viewporter results in integer surface size, and it doesn't.
09:34 kchibisov: I understand, but I'm saying that client doesn't really control the buffer size with mesa.
09:34 pq: oh, the initial scale 1
09:34 pq: client does control the buffer size with Mesa EGL, but it very hard to get it right.
09:35 kchibisov: You have a broken result even when you 1) Create a EGL surface of size 800x600 2) and the next operation you do is an instant resize to 900x700.
09:35 kchibisov: So whatever you've passed initially will be used.
09:35 pq: yeah, you'd better use the right size from the beginning.
09:35 kchibisov: Right, but it's sooo common, because folks don't want to have a Nullable type.
09:36 pq: It's really wasteful to create one size and then immediately resize.
09:36 kchibisov: The issue I'd at least fix is the eglCreateContext/Surface latching other context.
09:36 pq: I don't understand how a Nullable type relates here.
09:37 emersion: pq, if you create an EGLSurface with arbitrary size at init time, your surface is guaranteed to never be null
09:37 emersion: but yeah it's not great
09:37 pq: yes, I'd those are bugs
09:37 emersion: allocating large slices of GPU memory is slow
09:38 kchibisov: I've seen a lot of folks doing a 1x1 wl_egl_window stuff.
09:38 kchibisov: And then they resize to the right thing on new configure.
09:39 kchibisov: I've fixed myself 3-4 other folks clients due to that...
09:39 kchibisov: And there's still a Qt issue due to exact same reason...
09:40 pq: What do you want to happen? Compositors stop sending the protocol error for a while?
09:40 kchibisov: https://bugreports.qt.io/browse/QTBUG-106273
09:41 kchibisov: I would at least don't send them for dmabuf for a while.
09:41 pq: Compositors certainly can stop sending protocol errors if they want to.
09:41 kchibisov: If it was an error from wl_shm they must kill the client.
09:42 pq: but if compositors stop sending the error, then the problem disappears. Why would any client side get fixed then?
09:42 emersion: that sounds like a bad workaround
09:42 pq: can people actually see the glitch?
09:42 kchibisov: I can.
09:42 pq: oooh, that gives me an idea
09:42 kchibisov: That's the reason I'm aware of all of that, because I was fixing the glitch.
09:42 emersion: EGL can do shm, too
09:42 kchibisov: Hm, right...
09:43 kchibisov: And it'll have the exact same logic as dmabuf path.
09:43 pq: instead of a protocol error, a compositor could use placeholder "this window is bugged" graphics on the wl_surface until the size requirement is respected again.
09:44 pq: make the glitch much more severe while not killing the app
09:44 kchibisov: Just draw a bright pink window, it's really annoying.
09:45 pq: yeah, or if you want it fancy, have a text in it saying to file a bug or something - so if it doesn't go away immediately and the window becomes unusable, the user gets a clue.
09:46 kchibisov: Though, the thing is that once fractional scaling is in use, this glitch sort of goes away.
09:46 pq: why?
09:46 kchibisov: you'll have a wrong scaled client, but you can't kill it anymore.
09:46 pq: huh? why?
09:46 kchibisov: Because the cliest asked to be scaled to dst via viewporter.
09:46 pq: hwre does the requirement of integer wl_surface size disappear?
09:47 pq: doesn't fractional scaling depend on viewporter, which still guarantees integer wl_surface size?
09:47 kchibisov: I mean, it's all integer, you just can submit a buffer of wrong size.
09:48 kchibisov: And all the buffers are scale 1.
09:48 pq: oh course, but that's not a protocol error
09:48 kchibisov: exactly.
09:48 kchibisov: So the issue won't get fixed, but masked.
09:48 pq: we're not after wrongly scaled clients, we're after protocol errors
09:48 kchibisov: yeah, that's true.
09:48 pq: wrongly scaled clients are client bugs, we compositors don't care
09:49 kchibisov: I'm just saying that the root cause might not get actually fixed.
09:49 pq: right
09:49 kchibisov: Like if all toolkits do fractional scaling and all compositors do fractional scaling, the real issues will be tolerated.
09:49 pq: it just joins set the of other client bugs that a compositor cannot detect or warn about - life as usual
09:50 kchibisov: And given that Qt and kwin can do fractional scaling, the issue is not that big of a deal for kwin.
09:50 pq: yup, unless people see the glitch, which the compositor now cannot make more visible either
09:51 kchibisov: The only compositor where you can really observe glitches is sway, because it's tiling.
09:52 pq: oh, I thought you were able to see the glitch on a floating window
09:53 kchibisov: I know that glitch is observable in gnome and sway.
09:53 kchibisov: And in gnome the way to observe it is to try start your window maximized.
09:53 kchibisov: If the buffer is not matching what gnome wants it won't make it maximized.
09:54 q234rty: tbh I'm running a patched wlroots with that particular protocol error removed since the hidpi Xwayland patch I'm using also triggers that
09:54 q234rty: https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/733#note_1738935
09:54 pq: right, subtle
09:55 kchibisov: Maybe making Qt apps crash for a while is a way to go, it's widely used so it could get attention to get it fixed.
10:03 kchibisov: Oh, if they pick initial dimensions like 480x240 they'll work around the bug.
10:04 kchibisov: (technically 120x120)
11:48 pq: emersion, thanks for noticing that Weston code with unmerged kernel UAPI.
13:25 wlb: weston Merge request !1320 opened by Philipp Zabel (pH5) doc: add workaround for doxygen 1.9.6 bug with cairo >= 1.17.6 https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/1320
13:26 wlb: weston Merge request !1321 opened by Philipp Zabel (pH5) libweston: Document struct weston_mode https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/1321
14:12 wlb: weston Merge request !1321 merged \o/ (libweston: Document struct weston_mode https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/1321)
14:12 wlb: weston/main: Philipp Zabel * libweston: Document struct weston_mode https://gitlab.freedesktop.org/wayland/weston/commit/40df321d7c50 include/libweston/libweston.h
14:17 wlb: weston Merge request !1320 merged \o/ (doc: add workaround for doxygen 1.9.6 bug with cairo >= 1.17.6 https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/1320)
14:17 wlb: weston/main: Philipp Zabel * doc: add workaround for doxygen 1.9.6 bug with cairo >= 1.17.6 https://gitlab.freedesktop.org/wayland/weston/commit/f8611607ecb5 doc/sphinx/meson.build
18:37 riteo: hi!
18:37 riteo: kennylevinsen: sorry for the delay, I have one right here
18:37 riteo: one sec...
18:38 riteo: https://0x0.st/Hjl6.log
18:38 riteo: there
18:40 riteo: the popup gets done, it gets a new buffer attached, gets destroyed, commits, syncs (I don't remember if I added it for good luck) aaand... Errno 32.
18:40 riteo: (the segfault is artificial)
18:42 riteo: uh by get destroyed I meant damaged, sorry
18:43 kennylevinsen: are you reading/flushing the display yourself? if this had been a dispatch, it should have shown the protocol error that caused the disconnect
18:44 riteo: I'm handling event handling myself yes
18:44 riteo: but I do get protocol errors
18:44 kennylevinsen: it could look like you flushed yourself, and exploded immediately as write failed despite there still being stuff to read
18:45 riteo: Actually I don't think I ever manually flushed
18:45 kennylevinsen: I don't see a protocol error here, just "error 32 while flushing the Wayland display"
18:45 riteo: sorry, I meant that I usually get protocol errors
18:45 kennylevinsen: (protocol error means that you received a message from the server telling you exactly what crime the client committed, followed by disconnection)
18:45 kennylevinsen: right
18:45 riteo: oh I wrote flushing in the error message thinking about it
18:45 riteo: lemme check
18:46 riteo: oops I indeed flush in the main polling thread lol
18:46 riteo: there's an error condition in there though
18:47 kennylevinsen: there are times it makes sense to flush, but you should not die on the failed *write* - wait for the failed dispatch/read
18:47 riteo: and as I said I sometimes get protocol errors, also I have other logic to read the error so I'm not sure what I'm doing wrong there
18:47 kennylevinsen: without the debug log *with* the protocol error, I cannot say what is done wrong
18:47 kennylevinsen: but the complete log should show the exact sequence of events causing the issue
18:48 riteo: oh wait, so I should keep dispatching even after a failed flush?
18:48 kennylevinsen: dispatch until failed dispatch
18:48 riteo: all right, let me change the code
18:49 kennylevinsen: if you stop after failed flush, you might still have stuff to read (such as the protocol error!)
18:49 riteo: ohhh I see now, thanks for letting me know
19:05 riteo: noooooo
19:05 riteo: > xdg_surface@71: error 3: xdg_surface has never been configured
19:06 riteo: it is indeed the dreaded issue I linked then
19:07 riteo: thanks a lot for helping me with debugging the issue btw!
19:08 riteo: sooo... I can only hope that mutter/kwin have some saner behaviour, otherwise we're in a bad situation
19:08 riteo: but first I'll make a new log
19:18 riteo: wait all of a sudden it looks like it doesn't dispatch anymore, it just spins...
19:18 kennylevinsen: Unless it's a compositor bug, you're always in a bad place when you have protocol errors - even if other compositor are more lenient... :)
19:18 kennylevinsen: but might be fixable
19:19 riteo: I just noticed that in the docs it recommends to call the dispatch method both if the queue is empty and _after_ canelling/reading the events: https://wayland.freedesktop.org/docs/html/apb.html#Client-classwl__display_1a40039c1169b153269a3dc0796a54ddb0
19:19 riteo: why?
19:20 riteo: sorry I meant "is _not_ empty"
19:24 kennylevinsen: it dispatches until there is nothing left to dispatch (first loop), then writes, waits for data, and dispatches regardless of whether poll failed
19:25 kennylevinsen: Dispatching events first ensure that you do not block waiting for new stuff to read when you already have old stuff to do (deadlock), and dispatching after is... Well because you just read stuff so you have work to do.
19:25 riteo: I put this into a while loop though
19:25 riteo: so it'll prepare read and dispatch right away
19:25 riteo: it's a dedicated thread
19:27 kennylevinsen: In that case you can just as well call wl_dispatch
19:27 riteo: twice?
19:27 kennylevinsen: *wl_display_dispatch
19:27 kennylevinsen: Instead of rolling your own loop
19:27 riteo: oh there's a reason
19:27 riteo: we need a mutex
19:28 riteo: the main thread can do some stuff so locking avoids changing/reading data while the events thread is still changing it
19:28 kennylevinsen: Either way, follow the suggested standard method unless you're certain you do not need it - and certain you want to debug thsy yourself ;)
19:28 kennylevinsen: *that you dumb phone
19:28 riteo:shrugs
19:28 riteo: I'll just implement it as the docs say
19:29 riteo: well, as I said I'll make a new log, send it here and... dunno, the issue is known
19:29 riteo: it's definitely a bug though as inert objects should just take requests as no-ops
19:29 riteo: that's just the nature of the asynchronous protocol
19:31 emersion: drakulix[m], d_ed[m]: do you want to meet sometime this week?
19:32 drakulix[m]: I was under the impression that there is a KDE conference this week and thus they couldn't make it.
19:32 drakulix[m]: And given the protocols we want to talk about, I was thinking next week might be a better time for the next w-p meeting.
19:33 emersion: oh right
19:33 emersion: sorry, i didn't remember
19:34 drakulix[m]: btw, do I remember correctly, that you wanted to post the minutes to the wiki?
19:38 riteo: there: https://0x0.st/HjUH.log
19:39 riteo: same thing but now we have a nice more reasonable debug log, nice :)
20:22 riteo: well, I'll experiment and see how's the situation going elsewhere. Thanks for everything, cya!