00:39 daniels: mareko: ask ajax and MrCooper, but I think they only really need llvmpipe and spice
00:43 daniels: jenatali: thankyou!
00:44 jenatali: I'd been putting it off. I really hate building LLVM
00:45 daniels: me too buddy
00:59 jenatali: Apparently the Vulkan runtime no longer installs unattended with /S but now the SDK includes it?
01:54 zmike: tarceri: actually I assigned for you to make sure it goes in since it's blocking another MR from landing
02:15 mareko: MrCooper: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33211/diffs?commit_id=70398ff5140891899927590c46d27ef8c48c6898
02:45 jenatali: Ugh how do I see which LLVM module is needed but missing?
02:49 airlied: usually grep
07:45 MrCooper: mareko: technically I'm in a different team now (focus on mutter & Xwayland), as is ajax, so you rather need to ask airlied or José Exposito; AFAIR we do support amdgpu with acceleration on ppc64el in RHEL in principle though, so not having any CI coverage isn't great
08:45 sima: dakr, good mail, thanks for doing the wrestling
08:46 sima: also chatted with airlied and we're at 15+ years of dma-api maintainers randomly nacking stuff gpu drivers want/need
10:27 heuristicsman: what i indeed assume , but do not have so much experience yet on doubly compressed intrinsics alike, is: + - are compatible only if the subtrahand before adding is enough big yet always smaller than the one subtracted from (naturally the case), so in other words, i do not think single operand meets the requirement, so the data banks scaffold of selection needs to be decoded from
10:27 heuristicsman: compilers work and operand needs to be added to that bank and then again encoded together back to double encoding to yield needed result, since the banks answer selection logic is enough big it has both views decode and encode which are compatible. I think there are only minor such rules, but saviour is that multibank data access can be done and batched add for operands, which should be
10:27 heuristicsman: compatible with say non-illformed output. Needs a bit confirming, i kinda recall those rules from the past testings.
10:57 sima: DemiMarie, on the amd/virtio discussion, all these issues you point out is why I think there's either pup(FOLL_LONGTERM) or real hw support so that the iommu/gpu handles page faults/invalidations at the hw level
10:57 sima: and the mmu notifiers just pass tlb flush commands forward as needed
10:57 sima: anything else indeed just falls apart everywhere at the seams
11:10 tangoentanglement: so maybe it is possible to pad operands, to stay at required size, i.e that it would not yield bigger value from smaller so it would be able to pass dependecies down the line to later instructions sort of like coalesce to subtract, if this is not possible than only constants can be compiled in and passed forward at compile time straight as deps to later instructions, but anyways
11:10 tangoentanglement: taking the bank of selection scaffolds and decoding them is not so high overhead either. I am testing all that on this month February, but i think i did make sense at least i vaguely remember so from calculator times.
13:00 jinglearoundstars: it much looks like the transition values with padding would indeed function, so you pass startingfrommaxvalue_as_anyconstant+singlepackedvalue and consistently receive as such from double encoded scaffolds, so receiver slots would be decoded properly, can be defined as offset which otherwise would not be used. As if only operands were decoded with offset appended i think then they
13:00 jinglearoundstars: can be received well in double encoded slots of receive or arrival slots. But those parts in the logs i can not remember where they landed, the execution itself is easy, but still a number of lines of work to do, as i said -- i am not interested in sharing that work anymore. That is very reactive or say invasive code to the world, but i assume many parties of rich people actually
13:00 jinglearoundstars: have it to print money to back up their manufacturing losses. I mean i am not able to compete without leveling up there, otherwise it much looks like i would get eaten for breakfast. But if the value is double encoded from subtract, it needs no offset anymore, cause compiler already filled it. I am in a war with many terrorists however i am not likely in conflict with the successful
13:00 jinglearoundstars: people, due to those other morons bothering me with teams , i need to push better code to my own, that is appearing as stereotypical side effect, people have asked why do i need to go there, well those are interesting questions. Answer is stereotypical targeted envy by leftovers at me, which you do not get , so it is rather strange to you as to why i work so hard.
13:12 jinglearoundstars: but there is no triple encoding anywhere in compiler anymore (so this simplifies), only addressing, since triple encoding is a side of adding two double encoded hashes addressed naturally.
13:13 jinglearoundstars: that is because i succeeded in data banks access
13:13 jinglearoundstars: so triple encodings and above is not anymore needed.
13:25 zmike: mareko: is LINEAR really not supported for RGBA32F formats?
13:28 zmike: cuz it seems to work...
14:26 sandiorboiko: maybe i explained it again not as sharply overall, but you see double encodings are simpler to be done as address mappings imo, there is no need to do full encoding if the form is already some bank that is only at 3000digits, so it can be now addressed from table, but overall i am screwed, my timeline on the works is starting to get tight , grandmother wants to kick me out, and wolves
14:26 sandiorboiko: are as much against me as they ever were. I DO NOT KNOW, WHERE they let me live like a real human it's entirely hippocratic like what is happening. I get overthe line after months, but i do not have that time. It is very difficult last effort that i am trying, and i need it to happen, but can not work at all in such environments they cheat me in.
14:56 DemiMarie: sima: In this case pup(FOLL_LONGTERM) is even more attractive because device memory is just virtual memory.
14:57 DemiMarie: sima: Can the forced migration to device memory be done reliably?
14:58 DemiMarie: Also, time to bypass the DMA API maintainers and send something directly to Linus?
15:00 phasta: You should think long-term. Are then fixes and reworks also to be sent directly to him 3 years down the road?
15:02 sima: DemiMarie, I didn't really follow that part since it was about virtio specific things
15:03 sima: the kernel really can't, because if you do this like hmm you again need hw support for pagefaults
15:03 sima: plus hmm cannot guarantee migration to device memory
15:04 DemiMarie: sima: the idea I had is to move the pages to device memory and leave them there
15:05 sima: anon memory probably freaks out to no end if it's suddenly device memory without a struct page
15:05 DemiMarie: If you don't have HW support for pagefaults then it's up to the host kernel to fail the operation.
15:05 DemiMarie: What about device memory with a struct page?
15:05 sima: you could do it as coherent device memory, then anon memory works in your device memory (unlike device private memory that hmm uses)
15:06 sima: but you're again stuck on the core mm's inability to guarantee migration
15:06 sima: migration is all best effort
15:06 DemiMarie: stop_machine()? Only half joking.
15:06 sima: not enough
15:07 DemiMarie: Why can't migration be reliable?
15:07 sima: linux core mm does a lot of randomly grabbing a page/folio reference, and those all block migration
15:07 sima: with enough whacking it mostly works for stuff like cma or memory hotunplug with zone_moveable, but it's brittle
15:08 DemiMarie: What about make_device_exclusive_range() or similar, but without the exclusive part?
15:08 sima: pup(FOLL_LONGTERM) is one of the pieces to make it less brittle, so that you know whether an elevated refcount is temporary and more retrying should help
15:08 sima: or a permanent pin, and more retrying is only going to heat the world
15:08 sima: DemiMarie, that doesn't move anything
15:09 heat:the world
15:10 sima: DemiMarie, I guess you could try with coherent device memory and just migrating really, really hard
15:10 sima: then you're at the same peril like cma or memory hotunplug
15:10 DemiMarie: sima: could there be a way to lock out anyone who tries to grab a reference?
15:10 sima: but for per critical stuff like hmm migration it's fundamentally fallible
15:11 sima: DemiMarie, disable all the cool features like transparent hugepages
15:11 sima: numa load balancing
15:11 sima: ksm
15:11 sima: writeback too iirc
15:11 sima: constantly more getting added
15:11 sima: defo direct i/o
15:11 DemiMarie: sima: I meant "grab a mutex so they block"
15:12 sima: no
15:12 sima: DemiMarie, https://chaos.social/@sima/113911739075079093
15:12 heat: in theory you could do that but you'd create "heating the world" on the opposite, refgrabbing direction
15:12 DemiMarie: Why is that?
15:13 sima: see link but tldr is the linux core mm is designed on the principle that quicksand is awesome
15:13 heat: because if there was a refcount lock-out you'd spin on folio_get
15:13 heat: because there isn't, you spin on page migration (or fail)
15:14 heat: it's way easier to fail page migration than failing a normal-ass refcount
15:14 sima: it's also that core mm is lockless to the max
15:14 DemiMarie: For performance reasons?
15:14 heat: yes
15:14 sima: so even if you hold a reference and the lock for something, it's really surprising how little guarantees that often gives you
15:15 sima: like the entire pte walking is just pure yolo, and it happens absolutely everywhere all the time
15:15 DemiMarie: Why does it not crash? RCU?
15:15 heat: hey it's not pure yolo it's homebred RCU
15:15 sima: some of the best people in the world banging their heads at it for decades
15:16 heat: gup_fast generally just disables interrupts and doesn't use RCU
15:16 sima: heat, oh yeah it's a work of art
15:16 heat: to free a page table you need to do a TLB shootdown thus IPI thus if your IRQs are disabled it's safe to traverse
15:16 heat: it is in effect homebred RCU
15:17 sima: there's also so much fun due to locking inversions
15:17 sima: where you lookup a thing, grab the locks and then recheck whether you got the right one
15:17 sima: and there's fundamentally no way to just take a lock to make things stable
15:17 sima: and it's getting worse every year, like with lockless vma traversals and page faults
15:17 DemiMarie: I wonder at what point it would actually have been faster (dev time wise) to formally prove the whole thing correct and not have to do the debugging.
15:18 sima: DemiMarie, open random file in mm/ and stand back in awe at the if ladders
15:18 sima: especially anything handling pagetable entries
15:18 sima: but yeah formal proof probably good idea
15:19 sima: but the issue is also, what do you even want to proof
15:19 DemiMarie: "no memory corruption"
15:19 sima: because some things look very, very fish from a "will it livelock" pov
15:19 sima: not even close to enough
15:19 DemiMarie: no deadlocks, no livelocks, etc
15:19 sima: the livelocks are real pain
15:20 sima: and often stochastic stuff
15:20 sima: like the race windows align such that you win often enough to never pile up, but if you'd have consistently bad luck you'd pile up
15:20 DemiMarie:wonders if past a certain point people should just be using multiple machines, rather than trying to make mm scale to huge machines
15:20 sima: yes
15:20 sima: cloud didn't happen just for fun
15:20 heat: this is not just about making mm scale to huge machines
15:21 heat: small machines are also heavily impacted
15:21 heat: big locks suck
15:21 sima: yeah small cros tend to really thrash mm
15:21 heat: the per-vma locking patches address problems <checks notes> in android when apps create like 80 threads at startup
15:21 DemiMarie: Big locks suck unless you care about reliability and security way more than performance. I suspect that is why OpenBSD is so full of them.
15:22 heat: OpenBSD is full of them because it's a hobby kernel
15:22 sima: yup
15:22 heat: they would like to get rid of them and are slowly doing so
15:22 sima: that too
15:22 sima: like I think core mm is probably one place where rust wont help
15:23 sima: like some of the memory barrier comments in there are just pure nightmare fodder
15:23 DemiMarie: ATS might, though. That's full dependent & linear types.
15:23 sima: since it's not just about your cpu code, but also about stuff like how tlb fetches actually walk pagetables on your machine
15:24 heat: like, yes big locks make for simpler code, which is nice for security and reliability. but they also make you prone to suffer terrible choking on those huge locks, thus a reliability problem (and in effect, probably a security one, depending on what you're running)
15:25 sima: DemiMarie, I think more formal proofing would be good, afaik only rcu in upstream linux is fully formally proved
15:26 DemiMarie: sima: I was thinking of extracting core mm from F* or Coq.
15:27 DemiMarie: heat: I think safety critical systems prefer to use multiple components that are individually single-threaded. They can scale by having many cores that don't share memory.
15:27 sima: DemiMarie, e.g. https://lore.kernel.org/dri-devel/887df26d-b8bb-48df-af2f-21b220ef22e6@redhat.com/ last paragraph
15:27 sima: device-exclusive was added, but not everywhere, boom in way too many places
15:28 DemiMarie: Honestly I think userptr is rather cursed.
15:31 DemiMarie: Can migration be reliable enough to make uAPI depend on it?
15:33 DemiMarie: I also wonder if this could be dealt with using hypervisor magic: "hey, that page of mine is a blob object now"
15:42 mareko: zmike: why wouldn't it be supported?
15:42 zmike: mareko: I have an MR to fix
17:46 neverthelessmaniac: how to explain here, well encoding to compressed format deploys encoder from the big value from i-cache and result scaffolds virtual cache from structures, the last which is the most overheadish operation, the data loop isn't perfect either but slimmer by say 3fold perhaps? So it's hence cheaper in the compiler to embed state as remainder of whole buffer of banks which compiler
17:46 neverthelessmaniac: lifted anyways already once, but we do not want to do that so often. then you can say things like, i want the first bank, and remove all of the other banks of trillion options in the register, and that since decoding is done in lookup table ends up as being faster. That is also lot faster for IO. Now you write intrinsics say you want to access some tiled set of answer banks, you
17:46 neverthelessmaniac: have toppest largest state, and when you remove first state you get everything but first etc. This option became possible cause of this simple fs hack i posted which is not as heavy as encoding from full initial value to packed. So now you can address say you want to bring together bankset1 and bankset2 and hit to execute them, so deps would go from first set to second and as to
17:46 neverthelessmaniac: however you need. so remember decoding is cheaper than initial encoding, so you want to go more this way for perfectionism on performance.
17:54 jenatali: Ugh. Meson 1.5.1 can't use CMake to find LLVM 19
17:54 jenatali: What a mess
17:55 daniels: jenatali: ...
17:56 jenatali: Means I need to rebuild the primary Windows container too to get a new meson apparently
17:58 daniels:twitches
17:58 daniels: that was a deeply unpleasant time of my life
17:58 daniels: the bit where I broke up with my long-term girlfriend was probably way less damaging than Windows + Meson + LLVM + CMake + CI
17:59 jenatali: Yeah... I got the build working locally with llvm19 so at least I'm pretty confident that just bumping meson should work
17:59 daniels: heading out now, fingers crossed for you tho :)
18:11 dj-death: daniels: and you do this for work...
18:49 jenatali: Aaaand new meson doesn't install without long paths enabled
18:50 jenatali: I hate dependency updates
18:57 mareko: wouldn't it be nice if LLVM wasn't required by Mesa
19:03 jenatali: Mhmm
19:04 jenatali: LLVM as a runtime dependency is terrible
19:07 kisak: mareko: hypothetically, how would you feel about delaying pulling llvm<18 support until after mesa 25.0-branchpoint and hopefully radeonsi/ACO is good to go for the newer AMD gfx generations by the time 25.1 rolls around? ~non-sequitor~ If the mesa build sees llvm 15 is around, but not usable with radeonsi/llvm, will it automatically build radeonsi/ACO or will it fail the build as requirements not met
19:07 kisak: for radeonsi/llvm?
19:09 kisak: jenatali: llvm being too new for meson autodetect is a chronic issue. Over in Debian land, the build system adds in the equivilent to
19:09 kisak: export PATH:=/usr/lib/llvm-15/bin/:$(PATH)
19:10 jenatali: Yeah but Windows doesn't do llvm-config :(
19:10 kisak: well, that's dandy
19:11 jenatali: Fun, LLVM 19 requires /Zc:preprocessor for MSVC to be able to compile its headers
19:11 jenatali: Hopefully Mesa likes that too
19:15 jenatali: Looks like yes, phew
19:20 dcbaker: jenatali: we shouldn’t require long paths in meson. That sounds like a bug on our end
19:21 jenatali: dcbaker: It was a test that got run during chocolatey install that was too long
19:21 jenatali: I'll grab the log, one sec
19:22 alyssa: mareko: llvmpipe's existence makes that kind of a nonstarter..
19:23 jenatali: dcbaker: Ah pip, not choco. Log: https://gitlab.freedesktop.org/mesa/mesa/-/jobs/70278327#L391
19:24 jenatali: And I was wrong it's not meson it's numpy :(
19:24 jenatali: Oh it's meson's tests running as part of numpy's install. Gross
19:25 dcbaker: jenatali: of course it’s cmake… and of course it’s in numpy which has a vednored copy of meson while we get some of their stuff upstream…
19:25 dcbaker: I wonder if I can ask the numpy folks to not run our tests on install
19:26 jenatali: Seems like the right call
19:30 dcbaker: Although that’s also an old version of numpy and numpy >=2.0 should work
19:35 mareko: kisak: I can delay that. LLVM isn't required by AMD drivers and ACO is used when LLVM is disabled at build time, but it's also not a tested or optimized configuration on RDNA 1-4. It's possible that when you enable llvmpipe, it also enables LLVM for radeonsi.
19:36 mareko: radeonsi+ACO likely won't be ready by 25.1
19:37 jenatali: dcbaker: There's an issue with some of Mesa's scripts that prevent it from working with >= 2.0
19:38 dcbaker: Sigh. I guess i only fixed piglit. I probably fix that
19:38 jenatali: Oh maybe it was piglit, I don't remember. That same container gets used to build both
19:40 jenatali: Yeah https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29649#note_2493559 says it was piglit
19:40 jenatali: Should've checked if that constraint could be removed. Oh well
20:29 simplestofsuch: what intrinsics i meant address of bankset, then address of a bank, then address of cellset, address of a cell which are all address of programset. So four 3000digit worth values are enough, you do not need mul for this. Now you add those fields with accessor routines, so first two bankssets to be accessed is some index and from them first two banksets you target three banks, then four
20:29 simplestofsuch: cellssets then 9 cells. and what happens next they were double encoded already, you reindex them according to access and run the lookup table on it as you got them from a data bank, since it was dependency tree, so that index yields you enourmous bunch of instructions which you can subindex again or not etc, but the compiler itself did only two rounds however from which the last was
20:29 simplestofsuch: inexpensive, in other words what compiler did was encode once a very expensive loop then add together some single encoded values and encode them into a tiled address. due to the data access routine i posted it can be done, but there is only one catch the intrinsic like showed on the paper of Cornell's library hindi genius needs inexpensive lookup tables cause addends or second operator
20:29 simplestofsuch: needs to be decoded differently, since the compiler would anyways hide the latency, you end up doing the sets transitions in whatever batch with addressing, you want to add 4 12 or 23 of them together it's upto you. simplest is two , not very hard is 20 you just shift the operands like adder intrinsic is shown in the paper, but i knew this too tbh. i figured out similar things.
20:57 diacibenuci: so all i tried to say is from the moment of encoding the first big value, it's saner not to use that loop anymore, since if you permute two double encoded values the curve is already polynomialy larger set, now if you do 20 of them you are already at millions of qubits etc. without any performance loss, since the lookup table is just a small magic value. all you do is read the bits then
20:57 diacibenuci: change 62 if present to say 69 and if not no access done, and this goes very fast, lot faster than the full dma or loop encoder. More performance is not possible it's military grade scheme what i develop but i do not want to work with military if they kill wrong people.
21:01 jenatali: Uh... glsl compiler warnings test is failing with access violation (segfault) and I don't repro it :(
21:04 DemiMarie: sima: Actually, there is another option: try to migrate the pages, and if that is not possible, either return an error to userspace or leave the pages on the CPU and try again later.
21:10 jenatali: Uh... and passed on re-run. That's not good
21:27 ledookyn: I started the rant with, that you would not see that full encoder anymore in real code, cause there is no point for this anymore, so you would not able to understand any of the actual code if i was not speaking about it. And with that i try to finish the story too. You can change the depth of vision by just using a magic value, back and forth by adding say 1000bank values together together
21:27 ledookyn: one after another (shifting their operands), and encoding that to smaller number back but on the fly with logics in magic value. so now a set has in every cell 20times 64bit etc. That is though real mathematics however i must say i am not very good mathematician, however when every cell has 20times 64bit and encoded in 3000digits, it's arguably already using such formula
21:27 ledookyn: trilliontimestrilliontimes20trillions, i do not have access to such calculator yet, i am choosing something. so that is something like millions of qubits likely, whatever i do not know, i am tired now. Such men as you dwfreed should be fucking dropped to sharks or crocodiles food, fucking annoying shitbag you are.