Heads up: this needs some background in CS or programming.
Most humour is incidental. I have tried using small words and first principles, which sometimes looks funny accidentally. (The humour in the text, I mean, not in my work.)
If it isn’t clear, in systems, the word “memory” refers to RAM, and not storage/HDD/SSD. Those are called disk, a legacy catch-all name for storage devices. (Disk and memory are abstract nouns in systems.)
NVIDIA and other GPU vendors now support unified virtual memory. It took me eighteen months to notice that it isn’t a hollow technical name. (Compute unified device architecture says hello.) You might find my slides better but outdated.
There is a well-known CPU unified virtual memory, which
int main() sees, an obscure GPU virtual memory which the
CUDA kernel sees (which is actually not obvious: we used segmentation
faults to find the granularity).
And, now, there is a unified virtual memory, if you ask for it, which both devices see.
So, a loadq 0x7fffabcd0000 loads the same 64
bits on either device, a storeq 0x7fffabcd0000 is visible
to both devices, and a ld 0xdeadbeef on either device kills
a.out, modulo some coherence layer.
I find it a bit too logical to think of “unified virtual memory” as a proper noun and capitalize it: it is ultimately a unified CPU-GPU virtual memory.
We’ll get to the how of UVM later. Here’s the what.
The OS (kernel, not OS) swaps out excess memory to the disk and loads it back on demand. If it isn’t clear, the “demand” is detected when an invalid page table entry, a meticulous trap, is encountered, by the address translation hardware (MMU).
If you’re smart, you’ll notice this can only be done by the hardware:
software is flexible but slow, it’s filesystems material.
(open(2) says hello, movq $rdi $(rsi) says
bye.)
Now virtualizing memory, ie turning it into a help desk, instead of a self-service outlet, permits the kernel to do “stuff” in the background, viz swap memory. The conventional case for virtualization is isolation and portability, the behind-the-scenes opportunities are a distant third, but this is our USP.
Obviously, a non-virtual, physical memory, directly accessed by
a.out cannot be unified with the GPU memory. Thus,
UVM, like the Linux kernel, offers an arrogant, non-customizable,
big-brother, help-desk, dont-ask-me-how-I-did-that interface to the
programmer. And, it uses hardware support, viz. page table traversals
and page faults on either device, available on recent GPU
architectures.
An unexpected possibility brought out by virtualization was
memory oversubscription: the programmer can request for N + 1 GB against a physical GPU
memory (HBM) of N GB.
It’s there when a.out asks for it, maybe a bit late, but
it’ll be there eventually. That movq will
go through successfully, in finite time.
Don’t be surprised if it interferes with (paralyzes?) a different
concurrent movq though, that too will terminate a short
while later. That’s what memory thrashing looks like, even in the CPU
world.
I work on the device driver which implements UVM. NVIDIA released it in 2022. You can find it here.
Let me repeat, I am big brother. I am watching. I am that help desk. I am not Snowden, I am God. But I like transparency.
All device drivers operate alongside the Linux (?) kernel, with dangerous levels of privilege and hardware access. And in this context, it is the driver’s job to “see” the physical device, and to mediate between the device (GPU) and the ordinary user/user process trying to use it.
The driver is a window to the device, but neither the device, nor the entity actually using the device. The former is a little physical blob that responds to electric shocks. The latter is something like a C file/ELF binary with seemingly human, abstract, and non-lethal actions like library function calls.
Sadly, there are no C functions/APIs to send electric shocks, and few
safeguards to prevent the device driver from operating on the wrong
device. You should never load precompiled, blackbox binaries, except
maybe apt’s picks, and the driver prototypes I send.
We’ll briefly assume the CUDA APIs just work, until something goes wrong.
So, how does UVM work is UVM fast?
Turns out, it isn’t. A N + 2 GB workload is often 10-100 times slower than a N GB workload, or the same workload on a larger GPU. And N + 4 is sometimes too slow to see the light of the day, we don’t how what slowdown it incurs.
So, we need to dig deeper. How does UVM work?
Rather, how does the virtualization layer, or the driver, work? Rather, what does the driver do?
Suppose a device d1 ∈ {CPU, GPU1, GPU2, GPU3} tries reading a virtual memory address v. Mind you, a virtual address is a literal integer, 64 electronic bits, nothing fancy.
In the CPU world, v
corresponds to some physical address p, and the programmer, or the
userspace, or a.out, need not worry about p. The help desk is the x86/AMD64
ISA, the query is v, and the
person at the desk is the hardware (not the OS). It fetches 8/16/32/64
bits from the RAM stick to the processor.
It might need some OS intervention, which may potentially kill you,
the process, but you’ll never know, because you’ll be dead
beef in the graveyard. It might need to bring back a
swapped memory page, but you’ll never know, because the help desk is
opaque/virtualized. Or it might succeed in one go, and even the OS may
be kept in the dark: the OS is ultimately a client to the hardware, and
is asleep until the hardware prods it with a problematic memory access.
There are entry points to the OS, which the hardware uses, but
the OS cannot invoke itself, even though it is God.
But any valid query v will eventually bring you the datum on a platter, without you bothering with the details. Something between a nanosecond and tens of milliseconds, but in finite time. That is virtualization, that is the promise the OS and hardware make to the programmer. That is our hunky-dory Barbie world.
If this is new and confusing to you, celebrate, the operating system
has succeeded.
The point of time-sharing, virtualization, and other 1960s/1970s ideas
in computer science was to make programming simple, to offer black boxes
and help desks, which you ordinarily need not worry about.
Big brother hai to mumkin hai. Presumably, what some computer scientists like of CS is that these parts are not an inexact science.
Virtualization has proven to be a no-brainer in modern computer systems. Anything beyond an ASIC or embedded device is unimaginable without layers of virtualization. An undergraduate operating systems course, if you haven’t noticed, is actually a study of these help desks: memory, disk, networking, IO devices, and least trivially, time.
This is despite the sometimes-large overhead, and the expected
sub-optimalities between the layers.
Unsurprisingly, it sometimes helps when the process/programmer tells the
platform what it plans to do. (Using the fadvise(1),
madvise(2), sched(7) APIs, corresponding to
disk, memory, and time, for example.)
UVM does the same, but is a step too far, sometimes catastrophically
sub-optimal and uninformed (as a platform).
In some cases, this can be remedied using similar hints and
requests.
The GPU and UVM worlds are roughly the same. But there is a new possibility: the logical datum at address v is on device d2 ≠ d1. This is a far fault.
And mind you, p2, the current physical address need not equal p1, the physical address of the same byte(s) on device d1. Further, the pis are not ordinary 64-bit addresses or numbers: you also need to specify which device di you mean.
In fact, p1, p3, p4, … do not exist: the datum is not physically resident on any of those devices!
And that is the driver’s job:
Aside: the program doesn’t know it was stuck on line 550 or whatever for 200 μs. That is a secret between big brother, the hardware, his credit card, basically everyone other than the programmer and program, who are in the Disneyland of virtualization.
Far faults are slow almost by definition.
As an extremely elementary design principle, the hardware is fast, and service 109 or so memory requests in a second. Or, it can add two integers in half a nanosecond. It follows the Unix philosophy:
Do one thing, and do it well.
– Siliconbhai Taiwanwala
(Yes, I typed this em-dash. Ironically, the LLM discourse taught me about them.)
Neither chai-pani nor chaiwala can bribe siliconbhai to do things differently or intelligently. He was fabricated in a foundry in 2019, and is not going to mend his ways. At best, the Taiwanwalas can change future generations of Siliconbhai.
The software is flexible and intelligent. It can smell
0xdeadbeef and turn you into 0xdeadmeat
without running a page table walk and unsuccessfully trying to translate
v = 0xdeadbeef. (Some software is Sanatani, some isn’t.) It
can swap pages to memory, identify dfaulting,
console it, service the fault, put a competing process to sleep if
needed, etc. More generally (fundamentally?), software can be
replaced atop the same platform.
But any realistic software operation is at least a few hundred instructions long, which is a few hundred nanoseconds.
The mills of the code grind slowly, but they grind exceeding fine.
– My learned friend, the software
Needless to say, fault servicing, as explained above, is a software prerogative. It is slow, between 50 and 500 μs.
Aside: The other considerations in the hardware-software tradeoff are
(1) hardware draws power even if idling, and (2) circuit designs become
immensely large and costly if logical operations, say sorting, are
offloaded to them.
Infrequent operations, like disk accesses and sorting, are therefore
software/OS operations. Frequent operations, like memory accesses, are
assigned to the hardware. And they follow simple steps, to the extent
possible. (Compare the complexities of a page table walk, and reading an
offset from a fragmented file in an EXT4 disk partition.)
A less obvious decision is whether a 512-bit adder is a useful addition
to the ISA—both the software (using 64-bit adders) and hardware (using
sophomore knowledge) can efficiently perform the operation, and the
question boils down to the sort of cryptographic operations to be
performed.
Additionally, for context, a typical a.out accesses a
few dozen files in the process lifetime, and a few dozen memory
addresses every microsecond.
A typical GPU process has dozens to hundreds of compute units (CU/SM),
dozens of warps in each CU, and 32 threads in each warp, each generating
memory accesses by the nanosecond. Something like a minions version of
Siliconbhai.
The UVM paradigm expects the single-threaded OS kernel on CPU, designed for tasks like the former, to cater to the latter extreme. But it is not as difficult as it sounds, thanks to prefetching and PFN (physical memory pages) sizes, for a “good” program.
However, with or without UVM, it is “difficult” for the GPU RAM (HBM, high bandwidth memory) to cater to that level of concurrency. This is why GPUs use HBM instead of DRAM/DDR/regular memory technologies.
If you take a step back, you’ll notice this is where the problem begins: HBM is costly, therefore HBM is small, therefore oversubscription helps. We would not have gone down this rabbit hole if we could plug in a good ol’ 256-GB-or-so sized DRAM into a GPU, then it would be no worse than the CPU. Shared/DRAM physical memories work out only for gaming smartphones and “small” devices.
In fact, a plugged-in RAM stick, like on a CPU motherboard, simply cannot achieve the concurrency and bandwidth the GPU asks for. As far as I know, HBMs are embedded into the compute chip, at least on laptop motherboards.
Furthermore, that credit card, that extra 2 GB we loaned, will come back to bite, some day.
Memory oversubscription is effected by evicting memory from the GPU when required. While a non-UVM environment does not support allocations exceeding N GB, a UVM platform quietly evicts (analogous to swapping out) memory when oversubscription is hit.
And, it might need to do so again, when the evicted datum is re-faulted. And, again. And again. We call this thrashing.
And, thrashing is when the device driver actually receives more faults than it can process, a full deluge proportional to the GPU compute bandwidth. Undersubscribed workloads stop throwing around spanners once their (N − 1) GB or whatever has been migrated to the device, because there are no evictions. (N + 1), as discussed earlier, might be terrible.
Aside: the driver does not actually receive that many faults: the interrupt interface and handler (APIC) cannot admit and forward that many. Nor can the GPU MMU. Each stage has some ratelimiting logic, either in software or in hardware. How all of this still “works out” is that most of this deluge is duplicate faults.
This textbook example should help: an LRU cache eviction policy, for
a program accessing K + 1
array elements on loop, always evicts the datum needed in the next step,
and sees a lot of cache misses.
UVM, oversubscription, and thrashing, is yet another tale of locality,
evictions, and subpar intelligence.
You’ll find more in the filesystems world (the page cache), networks,
and of course, microarchiture.
This deluge of faults might well be a phenomenon worth studying: it is fully possible that the evictions triggered by a migration are as in-demand (hot pages) on the device as the migrated pages themselves. UVM with thrashing is a great mess, microarchitecurally and memory-wise.
The strongest tool in our armoury (as the programmer OR the device driver) is to remote-map data between the GPU and the large CPU memory: dfaulting is not dresident, but a special remote mappings, backed by a special hardware unit, bridges the two. v is now visible on two devices, not one.
And mind you, this is a hardware feature. The latency of such memory accesses is much lower than that of servicing a far-fault. Of course, it is much higher than the case where dfaulting = dresident.
Other options, which I do not work on, are rewriting CUDA programs with intelligent guesses about the UVM platform, or using a better eviction policy. (I strongly suspect the eviction policy is nearly optimal.)
The question really is which exact pages to remote-map, when, how
aggressively, etcetra.
How we do it is obviously by changing the driver. I’m skipping the
technicalities.
How we decide to do these things, viz how we improve the policy, is by profiling the prerogatives of the virtualization layer. There are no UVM profilers per se, out there. Existing NVIDIA profilers for non-UVM programs fall short, because the elephant in the room is oversubscription and thrashing. Not HBM throughput and the 2015 roofline.
My group and I have done a fair bit of profiling, but I cannot describe it here. I hope it’ll be published shortly, and in the public domain. Cheers.
Classified.
Converted to HTML using pandoc.