04:15 DavidHeidelberg: Been on RISC-V conf., tried run eglgears/glxgears on some presented risc-v laptops, got segfaults & another unhealty output from Mesa (sometimes 22.x :/ so, not even update)...
04:16 DavidHeidelberg: I'm not crying here, but thinking.. there is like ~ 6 months until next Debian starts entering freeze (so we bump the CI and we can cover risc-v).
04:17 DavidHeidelberg: eventually, we could at least drop Alpine container with risc-v, maybe into nightly runs... not sure how well it'll run on x86_64. We don't have any risc-v machines (now, but there will be some in close future available)
04:17 DavidHeidelberg: I have 3 pieces at home, but it's so slow, the x86_64 emul would be 10x faster.
04:19 DavidHeidelberg: For the HW, what I saw is Imagination (proprietary) or AMD (PCI-e, working kinda well on SiFive). Except that everything is softpipe or llvmpipe (thanks to ORCJIT :) these we're the ones which worked)
04:31 airlied: DavidHeidelberg: with PCIe it's often the PCIe hw is just broken
04:31 airlied: SoC PCIE hw is rarely tested on the things GPUs want
04:31 airlied: good to know orcjit works :-)
04:32 DavidHeidelberg: suprisingly, the PCIe GPUs seemed to work well (haha, let's be honest, no AAA titles was running there, but OpenArena level games)
04:34 DavidHeidelberg:regrets he didn't take pictures, they usually had cool PC cases on display
04:38 HdkR: perhaps fortunately, I don't play with RISCV hardware with GPUs. ARM hardware has mostly broken PCIe fabric :P
04:39 Company: how fast is llvmpipe on those things?
04:44 airlied: no idea what sort of vector support they have
04:46 DemiMarie: airlied: Is SoC HW broken, or are GPU drivers the broken ones?
04:47 airlied: DemiMarie: usually the hw
04:47 DemiMarie: airlied: in what ways?
04:47 airlied: they don't implement PCIE conformantly
04:47 DemiMarie: In what way?
04:47 airlied: usually they don't enact snooping support
04:47 DavidHeidelberg: Company: Answer is usually sorted to two groups, on small boards: No; on boards with PCIe (and better cores), it's Yes, but you use normal GPU anyway, so it doesn't matter :D
04:48 DemiMarie: airlied: I thought that the DMA API did not guarantee snooping.
04:48 DavidHeidelberg: yup, usually weird workarounds are needed, also I heard that for example recent AMD GPUs have some adjustments to work on these boards
04:49 Company: I'm just curious because people build PoS systems with those low-powered non-gpu systems
04:49 airlied: what has the DMA API got to do with the hw not doing it?
04:49 Company: and I'm waiting for the time when software GL is fast enough on those things
04:49 airlied: DavidHeidelberg: often the AMD adjustments are just hacks that disable a path, but overall the hw is screwed
04:50 DemiMarie: Company: doubt it will happen, once the CPU is fast enough I suspect they will want more things and it's back to a dedicated SW renderer.
04:50 airlied: there are endless threads on dri-devel with various non-x86 cpus trying to disable codepaths
04:50 airlied: the loongsoon folks being the most horrible
04:51 airlied: there was one loongsoon that I think used an AMD intergrated GPU on an x86 southbridge
04:51 airlied: or rather northbridge
04:51 HdkR: Most ARM hardware has heartburn for nGnRnE PCIe mappings
04:52 airlied: always gives me bad AGP flashbacks
04:52 DemiMarie: airlied: My understanding is that the DMA API does not guarantee snooping, so Linux drivers that assume it are buggy from a DMA API perspective (they need explicit cache flushes).
04:53 airlied: DemiMarie: we don't use the DMA API
04:53 HdkR: AGP PTSD, oh no
04:53 airlied: or at least we workaround it's lack of support for snooping, since GPU needs it
04:54 DemiMarie: Why do GPUs need snooping and not flushes?
04:54 airlied: hw designers gonna design hw :-)
04:55 DemiMarie: airlied: why can't one add SW flushes?
04:56 airlied: because we have userspace mappings
04:56 airlied: we don't just map stuff in the kernel
04:59 airlied: you also don't want to be throwing away your whole cache all the time
05:02 DemiMarie: I thought one could force writeback without invalidating. Are syscalls for cache maintenance too expensive?
05:03 airlied: why bother adding all that when the hw is meant to support it
05:03 DemiMarie: Obviously any mapping that is written from both sides would be busted, but that's racy anyway.
05:03 airlied: you have just a bunch of code that never gets tested
05:03 DemiMarie: airlied: which HW?
05:03 airlied: PCIE hw
05:03 DemiMarie: Are there any Arm SoCs that get this right?
05:03 DemiMarie: I presume POWER does
05:03 airlied: the new Ampere seems good
05:04 DemiMarie: That's a server board
05:04 airlied: plugging a 16x GPU into an SoC is often hard, but I've no idea, there is a lot of socs
05:05 airlied: I think jetson might have been good
05:05 airlied: not sure the plug a gpu into the rpi pcie 1x ever worked :-P
05:06 airlied: oh maybe the rpi5 works
05:08 introducelogics: So you will have 66+67+128 representing powers "2+4 +max" is the last aka. 2 in power of 31 that is in internal adders collision with "65+68+128" 1+8+max how would one want to get rid of collision? so 68 added to first and 69 to second, and answer sets adjust to the fact. They are unique now, that is the first family of solutions where you stretch the possible powers to another set
05:08 introducelogics: for example from 130to194, where collision happens 130+4 or 130+5 goes in, but howto solve this within encoder? The ddr has read-modify-write ok, but how it would know that collision happened in intermediate internal adder? is that what you think is the main puzzle? so is correction bank either write enable to 67+130 or 68+130, so the counter is pushed to a modify base stack, the
05:08 introducelogics: address calculator would use base differently depending on how the bases were written, are such pagetables or hw hacks very difficult? imo probably no, or what you would think? DMA is pushed with quite high end sanity in hw along with ddr controller. as you see it is meant for such offloads, so 66 get's discarded with 65 how? We skip 65 or 64 within IR of base 130 cause those are
05:08 introducelogics: unique (by means of WE signal 0 at their read location), and over one you add to the collision bank. 66 get's no candidate since no such address (at read base 0), 67 is on skip where as 68 is written to base130. so the invariant sums become 66+68+130+68 is akin of 66+68, and 67+66+0 is akin for it's own. Likely one correction/collision bank is enough but needs some thought yet. I
05:08 introducelogics: looked at alphametrics puzzles, but all this stuff needs long hours of practicing and note taking for me yet. PCI-sig is with expenses to be accessed, so i have seen some old specs with AGP myself too that ended up in the web.
05:16 DemiMarie: airlied: thanks for the explanation
05:35 HdkR: DemiMarie: NVIDIA ARM boards get PCIe correct, including Xavier and Orin and of course Thor next year. Plus their server Grace offering of course.
06:08 realmeninblack: This is more of a demonstration for alternate way of doing it to dull on dma and ddr capabilities and have another view at things. Exact real procedure has been already talked about , two rounds of encoding, and you will see that the after power base marking pair round robin collision banning it leaves only somewhere 1024*1024 combinations, and we leave an error margin by the same size
06:08 realmeninblack: just in case, it is very clear that this is going to work, however 1024*1024 is replaced with highest number of double rounds of encoding. I have been so busy that have not coded the loop yet to see which number is the biggest in that double round set. But yes my propasal should be possible. RPI also comes with low pricetag and lots of flexibility of oss, kinda expected that hw bugs
06:08 realmeninblack: creap in, and very expected for many other socs as well, prolly dma as well as DDR are stable however, but none of this touches the fact that you can do transform feedback on any stage of shader engine , since the core is either unified or fixed function, the last which works in command queue based io views only, first has shader stages which are shared in arch. PCIe bugs can however
06:08 realmeninblack: be such kind that other IOs on unified architectures are not accessible from shader engines at a bugfluke, though i understood that most bugs reflect the rates which are not lithoraphy bugs but more like soc trace or lane shielding or whatever bugs on the motherboard. DMA and ddr arch is hugely well designed.
11:31 magicalnull: now it's a bit hurry or busy situation , in another words other things eat time. But i can see that first power can be well 1+33, second 2+34, third 3+35, and null is 32+0, that yields single collision if both operands are same but they have different pc bases. 1+3 and 2+2 collide as others alike, 6+6 5+7 etc. That is the hardware way also.
13:47 magicalnull: the pc indexes are pulled from the other end, 65+1-33+65+3-35+35+33=33+33+33+35+1+3=138 aka 33+33+72=138 where as 72+32+32=136 , first is 1+3 , second 2+2 , shifting one operand to 64 get's minimal 1 difference on collisions saves very little space if one wants. So that is kind of it from me, i've been spamming too hard, i am sorry.
16:11 cheako: Hey, I was doing "well" but not great at writing vklayers in rust. ash, where I'm getting the vulkan types from, is good at writing vulkan structs... but is bad at reading them and I've a lot of code for just doing that. If ppl are interested in writing vulkan ICDs in rust, we should share code.
17:06 digetx: could anyone from Intel please ack the last patch of https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30988 that updates CI flake expectation for Zink/ADL test, Marge refuses to apply this mesa-cache MR because Intel CI fails due to the flaky test
17:44 kisak: mattst88: fwiw, media-libs/mesa-24.1.7 USE="opencl" was FTBFS for me. After transitioning back to my irregular mesa ebuild to go back to 24.2.5, it's fine again.
17:46 kisak: I should have grabbed a build log snippet. Alas...
20:19 benjaminl: is there a reliable way to get the nir variable associated with a {load,store}_input intrinsic?
21:20 karolherbst: benjaminl: usually you do that before IO lowering was done on the shader, and after lowering all the relevant information should be part of the load/store instruction
21:35 benjaminl: karolherbst: thanks! figured out that there was already a pass to do exactly this for the property I was interested in :)
21:35 karolherbst: ahh, cool
21:39 alyssa: benjaminl: my usual advice is "don't>
21:39 benjaminl: curious about the design intention behind doing it that way instead of preserving the var association?
21:39 alyssa: use locations on lowered i/o instead if you can
22:26 DemiMarie: airlied: which Ampere? Ampere Altra has an erratum (PCIE_65) that makes it not work, with the workaround being to emulate unaligned accesses in the kernel.
22:27 HdkR: They were referring to "The new Ampere" So that would be the AmpereOne
22:27 HdkR: New is relative considering it's already over a year old :D
22:29 HdkR: Just need System76 to immediately replace their new system with a recent chip instead
22:36 DemiMarie: Also, do the Nvidia chips work with generic PCIe GPUs, or just with their own? The Nvidia driver has a workaround for the bug.
22:36 DemiMarie: HdkR: IMO Linux should just add an emulator for unaligned access faults and enable it by default on all Arm machines.
22:37 HdkR: DemiMarie: I have a Radeon Pro W7500 plugged in to my Jetson Orin
22:37 HdkR: Eh, maybe the unaligned handler for everything makes sense, but I'd prefer if ARM vendors just fix their broken hardware.
22:43 HdkR: Although I definitely don't recommend buying an Orin board. It's old and has bugged atomics
22:44 HdkR: Thor will be a significant upgrade :)
22:45 daniels: trapping and fixing unaligned access is … not great for performance
22:46 HdkR: Also hard to be entirely correct when crossing 16B and cacheline access granularies
22:46 HdkR: granularities*
22:53 DemiMarie: daniels: what about recompiling everything with `-fstrict-align`?
22:53 DemiMarie: I’d prefer for hardware to be fixed too, but in the absence of that then an unaligned access emulator is the best option I know of.
22:53 iive: DemiMarie, I think that's a different type of align
22:54 DemiMarie: iive: the idea is to prevent the compiler from generating unaligned accesses so they don't need to be trapped
22:54 iive: why is the compiler generating unaligned access at all on architecture that doesn't support it?
23:01 iive: I understand if there is a bug where pointer arithmetic leads to unaligned access. but stuff that is entirely controlled by the compiler...
23:08 iive: my bad, it's the same align. but isn't it supposed to be set by march or target by default?
23:10 iive: hum... arm arch doesn't even have the options. aarch64 does.
23:27 iive: apparently ampere is aarch64.
23:28 HdkR: Indeed it is
23:28 HdkR: Neoverse-N1 based cores in the Ampere Altra, custom design in the AmpereOne