06:23lefteyebob: <gogolittle> Ermine: you think you some geniuses? hahaa i said earlier the encoder has 64 as bit 1 65 as bit two so looks as that hard or what? so 194-64 is 130 196-65 is 131, so decoder has that shifted towards right side 66+67 power 2 and power 3 which corresponds to 2 +4 133 corresponds to 6 after decoding. Ermine you are such a protector amazing monkey. intel-gfx, nouveau, dri-devel
06:23lefteyebob: are under his <gogolittle> protectorate and he can kill me without any money by offering anal to russian hitman, such a minix mini driver microkernel hero :).
06:23lefteyebob: <gogolittle> that should work out however i have not yet done any corpuses or dictionaries , which happens later this year, i just presented the theory
06:23lefteyebob: <gogolittle> it's that way how they do it in hw too, dft and fft is used for something other, signal domaining in time and shape or something
06:23lefteyebob: <gogolittle> i have not programmed for ages as of now, but that is the way it should work, hardware encoders and decoders work also in similar way, it's just that verilog i programmed a bit in a very long time ago , i think last time 3 years ago in cambodia
06:23lefteyebob: <gogolittle> i got all the info i needed but sure i feel a bit hungover after the battles of my life which happened there
06:23lefteyebob: <gogolittle> there were some psychologists persistantly blabbering how ill scizhophrenic i am to defend my life when other ghosts wanted to take that, in my opinion they can go and fuck themselves, such a dorks as possible and failures, maybe some idiotic woman gave them some , but i defended my life in fact successfully and their psycho shitflow no longer bothers me, sure it hurts for
06:23lefteyebob: those to go to jail and psychologist terrorist to lose their jobs, when you are a dickhead it needs to hurt.
18:20Company: so, I've been testing llvmpipe
18:20Company: and apparently performance in my benchmark went from 30 or so fps previously to 500fps
18:20Company: which confused me
18:21Company: and it turns out my gpu is under 100% load with that benchmark
18:21Company: why does llvmpipe use my gpu?
20:00Lynne: haasn: karolherbst brought up the topic of hybrid decoding here https://chaos.social/@karolherbst/112621561197599333 and I linked your dav1d fork
20:01Lynne: what were the latest performance/power figures on discrete/mobile again?
20:30haasn: Lynne: on discrete with uber gpu it was quite significant, like 50% iirc
20:30haasn: On mobile basically the same
20:30haasn: As cpu
20:31haasn: The shaders are hard to optimize
20:31haasn: I wanted to take another stab at it
20:32haasn: But dealing with u8vec4 etc was nontrivial for reasons I don’t remember
20:33haasn: And you have extra losses from memcpy due to the gpu’s complete inability to schedule threads
20:34Lynne: aren't mobile GPUs scalar, so vectors would make no difference?
20:35Lynne: IIRC modern GPUs abandoned dedicated vectors a-la-SIMD 10 years or so ago
20:36karolherbst: yeah, you don't want to vectorize inside your GPU code
20:37karolherbst: if you do: reconsidering not doing that is your only option really. There are some benefits of aligning data that single threads can do vector load/stores, but that's pretty much the only benefit you can get with vectorizing (well.. except there are GPU ISAs with vec2 fp16 ops)
20:40karolherbst: Lynne: anyway, thanks for sharing your experience. I got a lot of people saying what "in theory" is the situation, so.. kinda glad that some people actually worked on it and know what's up
20:40karolherbst: Lynne: but yeah.. I'm more concerned about AV1 atm, because a lot of GPUs don't have it, and "you need to buy a GPU from 3 years ago" isn't really a nice answer either
20:43Lynne: dav1d is extremely optimized, if that's any cosolation
20:43airlied: Company: is your desktop running on the gpu?
20:44karolherbst: Lynne: right.. I just wonder if it's optimized enough that your laptop batter gets through meetings :D
20:44karolherbst: but yeah...
20:44karolherbst: if it's quick enough then that's totally fine
20:44karolherbst: well not quick
20:44karolherbst: but like optimized
20:44karolherbst: or rather efficient
20:45karolherbst: I'm not caring about an "fps" metric here, I'm caring about "how long does a battery last while watching AV1 videos/streams" metrics
20:45airlied: haasn: what mobile were you targetting?
20:45airlied: would be interesting to see how it goes on intel hw pre-av1
20:46Lynne: karolherbst: meetings -> webrtc, which means there's a browser involved, which means there are lower branches you can pick off
20:47Lynne: firefox for example does like 7 memcpys of video data between decoding and actually presenting the data on screen IIRC
20:47karolherbst: oof
20:48karolherbst: but the API situation is also kinda a mess
20:49Lynne: as in decoding APIs? doesn't really matter when they download the data to feed it through their generic software codepath for color conversion and scaling
20:49karolherbst: right
20:51karolherbst: the entire situation sounds to me that we need some people really looking at from an end-to-end perspective to make that all efficient. In a perfect world you feed the video stream into some lib and then it displays it into a surface you choose, bonus points if you can provide a custom shader/whatever to filtering/color conversion/scaling/etc...
20:51karolherbst: though I think that's mostly what ffmpeg is
20:51karolherbst: but then you need a different code path for hw acceleration, because...
20:54karolherbst: maybe we should just consider doing efficient fallbacks with vulkan video..
20:54karolherbst: but I already hear people saying "no"
20:54airlied: you'd essentially be porting chunks of dav1d to llvmpipe
20:55airlied: or an llvmpipe equivalent
20:55Lynne: ffmpeg lets you switch between hardware and software decoding on a per-frame basis, and it only takes tens of lines of code to enable whichever decode api you want to use
20:55Lynne: but the issue is... patents
20:55karolherbst: I'd rather have people try it out and say "yeah well.. it's not that much faster, but you save a bit of power" before just thinking it's a bad idea
20:56karolherbst: Lynne: right... that as well, but it never felt like that e.g. chromium uses ffmpeg in a way that it could hardware accelerated..
20:56karolherbst: *be
20:56karolherbst: and if they enable vaapi support somehow, it sometimes just doesn't work
20:57Lynne: in firefox's case, they have an internal ffmpeg fork with every bit of patented code stripped out, and they only use it for vorbis IIRC
20:57karolherbst: heh
20:57karolherbst: no wonder why some distros just replace that with the ones they provide, so at least it would pick up stuff if it's a proper version
20:57Lynne: decoder code in ffmpeg still requires actual decoder code, even if going through a hardware decoder, in order to let software decoding fallback
20:57karolherbst: right..
20:58Lynne: and openh264 requires browsers to ship the actual binary that cisco releases to decode without needing a patent
20:58karolherbst: yeah.. but like h.264 is slowly getting a thing of the past, and CPU decoding ain't that expensive either
20:59Lynne: right, I forgot about that, I have to actually remind myself that this era has came
20:59karolherbst: heh
21:00karolherbst: anyway, I'm more concerned abput VP8/VP9/AV1 here and that most GPUs don't really accelerate that and users just being left with a terrible user experience from time to time
21:01karolherbst: like.. I know this is entirely my fault, but screencasting on my desktop is just a big no
21:01karolherbst: and I _think_ the future is to let pipewire handle all the details and pick up the proper hw acceleration
21:01karolherbst: or gstreamer rather
21:04airlied: I think one reason doing encode totally on the GPU with shaders might win is not having to readback the desktop content from uncached vram
21:04airlied: it might be less of a win on mobile though for the reason of not having pci bus
21:05karolherbst: yeah... I guess there is actually value in that, as long as the application doesn't ever let the CPU see the content and just straight push it to GL/VK/whatever to display it
21:05Lynne: I think pipewire handling video is a lost cause
21:05karolherbst: right..
21:05karolherbst: Lynne: how so?
21:06airlied: karolherbst: well for encode you want to read back the encoded data and send it somewhere else usually
21:06karolherbst: though I guess with pipewire/gstreamer you still get the result into some CPU side buffer, no?
21:06karolherbst: ohh right, encode
21:06Lynne: capture path in pipewire for clients is a stygian nightmare, requires you to write 2000 lines of libdbus code just to know whether you can capture a cursor or not
21:07karolherbst: though the encode thing most users are hitting is a reading from your webcam
21:07karolherbst: (and screen sharing)
21:08Lynne: modifiers for pipewire screen capture were added as an afterthought and require negotiation
21:09Lynne: a dedicated capture protocol for wayland is the way to go
21:09karolherbst: probably
21:09Lynne: gnome won't implement it, that's a guarantee
21:10karolherbst: I guess it depends on how big the benefits are
21:10airlied: wouldn't that just recreate the X11 problem of any app can capture your whole display?
21:11karolherbst: negotiation is kinda a pain here, but at some point you'll need to import a buffer/stream/whatever to do something with it, and in a perfect world it's a handle to a GPU buffer
21:11karolherbst: well.. wayland compositors _could_ implement access control on a wayland protocol thing, but that's just a different discussion altogether, and I'd rather not get involved
21:11Lynne: there are implementations of that already ^^
21:12karolherbst: I think the bigger question is if that solution is suitable for user cases like flatpak or not
21:13karolherbst: but anyway, I'm more concerned about where to go from "I use this webrtc based online meeting thing and it requires me to encode AV1 of my 4K desktop and sent it to others"
21:13karolherbst: and how we can make it not suck once the world moves to "AV1 or nothing"
21:14haasn: airlied: pixel 7
21:14karolherbst: but we are already at the same situation with VP8/VP9
21:15karolherbst: soo.. yeah
21:15karolherbst: I literally can't use the screencast feature of my destop, because it's just unusable due to the entire situation
21:16karolherbst: it's my fault for having a dual 4K setup, but using h.264 doesn't suck as much, and if we do "force" the use of VP8/VP9/AV1 to users, it shouldn't be worse than just using h.264
21:18Lynne: vp8 in particular should be practically free to decode anywhere
21:18Lynne: its a really simple codec
21:18karolherbst: yeah, but not to encode
21:18karolherbst: well...
21:18karolherbst: at least not to encode a dual 4K thing :D
21:18Lynne: yeah, to encode, libvpx is beyond awful
21:19Lynne: I think there was either a profile limit, a codec limit or a libvpx limit that forbade encoding with over 1500kbps bandwidth
21:20Lynne: so you couldn't really do 4k with that little, the signalling overhead (no data) from all blocks would likely cost you most of that
21:21karolherbst: though I think they might have improved the situation now...
21:22karolherbst: well...
21:22karolherbst: still uses 700% CPU on my i7-10850H
21:24karolherbst: oh right.. it was unusable on my macbook
21:25karolherbst: but the GPU is also at 100%, so it's kinda touch to record anything substantial
21:28Lynne: oh, right, libvpx-vp9 had a dedicated realtime mode that might not get used
21:28Lynne: IIRC I got hundreds of fps at 4k with it
21:29karolherbst: ohh right, somebody mentioned that in the past..
21:29karolherbst: I really should dig deeper at some point
21:33Lynne: I get 75fps at 4k with -deadline realtime -cpu-used 8
21:33karolherbst: try the same on aarch64
21:33Company: airlied: yes
21:34Company: airlied: it was just a casual "lemme check how this impacts software rendering" with MESA_VK_DEVICE_SELECT
21:34Company: like I used to use LIBGL_ALWAYS_SOFTWARE
21:34Company: which still works as I expect
21:37airlied: Company: so if you are sw rendering something and your compositor is recompositing it'll use the gpu
21:39Company: you mean mutter is using the gpu?
21:40Company: that's not where the 100% gpu load is from though - according to intel_gpu_top it's all my software rendering
21:43Company: i'm an idiot
21:44Company: airlied: ignore me - mclasen recently added code that skips the Vulkan renderer if it's software, so if I force software rendering that code kicks in
21:44Company: so asahi gets its GL renderer and doesn't use llvmpipe
22:00DemiMarie: karolherbst: is aarch64 slower?
22:01DemiMarie: Speaking of screen capture, how bad is it to have to copy every frame from a guest from GPU memory to CPU memory?
22:08karolherbst: on 4K that's like at least 2 GiB/s
22:09karolherbst: and pcie 3.0 x16 is like 16 GiB/s
22:09karolherbst: but of course there is more stuff going on
22:10karolherbst: but on aarch64 the problem is rather that I think the encoders aren't really optimized for aarch64, so it's slow
22:11HdkR: Just think of an encoder that is expecting to take advantage of single-cycle vector operations. Then you move to ARM and they take three cycles and you have half the pipelines :P
22:14karolherbst: heh
22:14karolherbst: I should check if it got any better though