What I Work on

Heads up: this needs some background in CS or programming.

Most humour is incidental. I have tried using small words and first principles, which sometimes looks funny accidentally. (The humour in the text, I mean, not in my work.)

If it isn’t clear, in systems, the word “memory” refers to RAM, and not storage/HDD/SSD. Those are called disk, a legacy catch-all name for storage devices. (Disk and memory are abstract nouns in systems.)

Unified Virtual Memory

NVIDIA and other GPU vendors now support unified virtual memory. It took me eighteen months to notice that it isn’t a hollow technical name. (Compute unified device architecture says hello.) You might find my slides better but outdated.

There is a well-known CPU unified virtual memory, which int main() sees, an obscure GPU virtual memory which the CUDA kernel sees (which is actually not obvious: we used segmentation faults to find the granularity).

And, now, there is a unified virtual memory, if you ask for it, which both devices see.

So, a loadq 0x7fffabcd0000 loads the same 64 bits on either device, a storeq 0x7fffabcd0000 is visible to both devices, and a ld 0xdeadbeef on either device kills a.out, modulo some coherence layer.

I find it a bit too logical to think of “unified virtual memory” as a proper noun and capitalize it: it is ultimately a unified CPU-GPU virtual memory.

Memory Oversubscription

We’ll get to the how of UVM later. Here’s the what.

The OS (kernel, not OS) swaps out excess memory to the disk and loads it back on demand. If it isn’t clear, the “demand” is detected when an invalid page table entry, a meticulous trap, is encountered, by the address translation hardware (MMU).

If you’re smart, you’ll notice this can only be done by the hardware: software is flexible but slow, it’s filesystems material. (open(2) says hello, movq $rdi $(rsi) says bye.)

Now virtualizing memory, ie turning it into a help desk, instead of a self-service outlet, permits the kernel to do “stuff” in the background, viz swap memory. The conventional case for virtualization is isolation and portability, the behind-the-scenes opportunities are a distant third, but this is our USP.

Obviously, a non-virtual, physical memory, directly accessed by a.out cannot be unified with the GPU memory. Thus, UVM, like the Linux kernel, offers an arrogant, non-customizable, big-brother, help-desk, dont-ask-me-how-I-did-that interface to the programmer. And, it uses hardware support, viz. page table traversals and page faults on either device, available on recent GPU architectures.

An unexpected possibility brought out by virtualization was memory oversubscription: the programmer can request for N + 1 GB against a physical GPU memory (HBM) of N GB.
It’s there when a.out asks for it, maybe a bit late, but it’ll be there eventually. That movq will go through successfully, in finite time.

Don’t be surprised if it interferes with (paralyzes?) a different concurrent movq though, that too will terminate a short while later. That’s what memory thrashing looks like, even in the CPU world.

Blackboxes, Kernelspace, and Abracadabra

I work on the device driver which implements UVM. NVIDIA released it in 2022. You can find it here.

Let me repeat, I am big brother. I am watching. I am that help desk. I am not Snowden, I am God. But I like transparency.

All device drivers operate alongside the Linux (?) kernel, with dangerous levels of privilege and hardware access. And in this context, it is the driver’s job to “see” the physical device, and to mediate between the device (GPU) and the ordinary user/user process trying to use it.

The driver is a window to the device, but neither the device, nor the entity actually using the device. The former is a little physical blob that responds to electric shocks. The latter is something like a C file/ELF binary with seemingly human, abstract, and non-lethal actions like library function calls.

Sadly, there are no C functions/APIs to send electric shocks, and few safeguards to prevent the device driver from operating on the wrong device. You should never load precompiled, blackbox binaries, except maybe apt’s picks, and the driver prototypes I send.

We’ll briefly assume the CUDA APIs just work, until something goes wrong.

So, ~~how does UVM work~~ is UVM fast?

Turns out, it isn’t. A N + 2 GB workload is often 10-100 times slower than a N GB workload, or the same workload on a larger GPU. And N + 4 is sometimes too slow to see the light of the day, we don’t how what slowdown it incurs.

So, we need to dig deeper. How does UVM work?

Rather, how does the virtualization layer, or the driver, work? Rather, what does the driver do?

Virtualization

Suppose a device d₁ ∈ {CPU, GPU₁, GPU₂, GPU₃} tries reading a virtual memory address v. Mind you, a virtual address is a literal integer, 64 electronic bits, nothing fancy.

In the CPU world, v corresponds to some physical address p, and the programmer, or the userspace, or a.out, need not worry about p. The help desk is the x86/AMD64 ISA, the query is v, and the person at the desk is the hardware (not the OS). It fetches 8/16/32/64 bits from the RAM stick to the processor.

It might need some OS intervention, which may potentially kill you, the process, but you’ll never know, because you’ll be dead beef in the graveyard. It might need to bring back a swapped memory page, but you’ll never know, because the help desk is opaque/virtualized. Or it might succeed in one go, and even the OS may be kept in the dark: the OS is ultimately a client to the hardware, and is asleep until the hardware prods it with a problematic memory access. There are entry points to the OS, which the hardware uses, but the OS cannot invoke itself, even though it is God.

But any valid query v will eventually bring you the datum on a platter, without you bothering with the details. Something between a nanosecond and tens of milliseconds, but in finite time. That is virtualization, that is the promise the OS and hardware make to the programmer. That is our hunky-dory Barbie world.

If this is new and confusing to you, celebrate, the operating system has succeeded.
The point of time-sharing, virtualization, and other 1960s/1970s ideas in computer science was to make programming simple, to offer black boxes and help desks, which you ordinarily need not worry about.

Big brother hai to mumkin hai. Presumably, what some computer scientists like of CS is that these parts are not an inexact science.

Virtualization has proven to be a no-brainer in modern computer systems. Anything beyond an ASIC or embedded device is unimaginable without layers of virtualization. An undergraduate operating systems course, if you haven’t noticed, is actually a study of these help desks: memory, disk, networking, IO devices, and least trivially, time.

This is despite the sometimes-large overhead, and the expected sub-optimalities between the layers.
Unsurprisingly, it sometimes helps when the process/programmer tells the platform what it plans to do. (Using the fadvise(1), madvise(2), sched(7) APIs, corresponding to disk, memory, and time, for example.)

UVM does the same, but is a step too far, sometimes catastrophically sub-optimal and uninformed (as a platform).
In some cases, this can be remedied using similar hints and requests.

Into the Black Box

The GPU and UVM worlds are roughly the same. But there is a new possibility: the logical datum at address v is on device d₂ ≠ d₁. This is a far fault.

And mind you, p₂, the current physical address need not equal p₁, the physical address of the same byte(s) on device d₁. Further, the p_is are not ordinary 64-bit addresses or numbers: you also need to specify which device d_i you mean.

In fact, p₁, p₃, p₄, … do not exist: the datum is not physically resident on any of those devices!

And that is the driver’s job:

Take note of a far fault f from device d_faulting, on virtual address v.
Find the current physical location d_resident.
d_resident has a page table mapping from v to p_resident. Kill it, we’re about to evict.
Physically copy data: d_resident → d_faulting. Now the hardware can reach it, but not the process. The help desk doesn’t yet know the shipment has arrived.
Create the new page table mapping: v to p_faulting
Raise a green signal. The messiah is here, the memory access is complete, and the program can now proceed to the next line.

Aside: the program doesn’t know it was stuck on line 550 or whatever for 200 μs. That is a secret between big brother, the hardware, his credit card, basically everyone other than the programmer and program, who are in the Disneyland of virtualization.

Why is it slow?

Far faults are slow almost by definition.

As an extremely elementary design principle, the hardware is fast, and service 10⁹ or so memory requests in a second. Or, it can add two integers in half a nanosecond. It follows the Unix philosophy:

Do one thing, and do it well.

– Siliconbhai Taiwanwala

(Yes, I typed this em-dash. Ironically, the LLM discourse taught me about them.)

Neither chai-pani nor chaiwala can bribe siliconbhai to do things differently or intelligently. He was fabricated in a foundry in 2019, and is not going to mend his ways. At best, the Taiwanwalas can change future generations of Siliconbhai.

The software is flexible and intelligent. It can smell 0xdeadbeef and turn you into 0xdeadmeat without running a page table walk and unsuccessfully trying to translate v = 0xdeadbeef. (Some software is Sanatani, some isn’t.) It can swap pages to memory, identify d_faulting, console it, service the fault, put a competing process to sleep if needed, etc. More generally (fundamentally?), software can be replaced atop the same platform.

But any realistic software operation is at least a few hundred instructions long, which is a few hundred nanoseconds.

The mills of the code grind slowly, but they grind exceeding fine.

– My learned friend, the software

Needless to say, fault servicing, as explained above, is a software prerogative. It is slow, between 50 and 500 μs.

Aside: The other considerations in the hardware-software tradeoff are (1) hardware draws power even if idling, and (2) circuit designs become immensely large and costly if logical operations, say sorting, are offloaded to them.
Infrequent operations, like disk accesses and sorting, are therefore software/OS operations. Frequent operations, like memory accesses, are assigned to the hardware. And they follow simple steps, to the extent possible. (Compare the complexities of a page table walk, and reading an offset from a fragmented file in an EXT4 disk partition.)
A less obvious decision is whether a 512-bit adder is a useful addition to the ISA—both the software (using 64-bit adders) and hardware (using sophomore knowledge) can efficiently perform the operation, and the question boils down to the sort of cryptographic operations to be performed.

Additionally, for context, a typical a.out accesses a few dozen files in the process lifetime, and a few dozen memory addresses every microsecond.
A typical GPU process has dozens to hundreds of compute units (CU/SM), dozens of warps in each CU, and 32 threads in each warp, each generating memory accesses by the nanosecond. Something like a minions version of Siliconbhai.

The UVM paradigm expects the single-threaded OS kernel on CPU, designed for tasks like the former, to cater to the latter extreme. But it is not as difficult as it sounds, thanks to prefetching and PFN (physical memory pages) sizes, for a “good” program.

However, with or without UVM, it is “difficult” for the GPU RAM (HBM, high bandwidth memory) to cater to that level of concurrency. This is why GPUs use HBM instead of DRAM/DDR/regular memory technologies.

If you take a step back, you’ll notice this is where the problem begins: HBM is costly, therefore HBM is small, therefore oversubscription helps. We would not have gone down this rabbit hole if we could plug in a good ol’ 256-GB-or-so sized DRAM into a GPU, then it would be no worse than the CPU. Shared/DRAM physical memories work out only for gaming smartphones and “small” devices.

In fact, a plugged-in RAM stick, like on a CPU motherboard, simply cannot achieve the concurrency and bandwidth the GPU asks for. As far as I know, HBMs are embedded into the compute chip, at least on laptop motherboards.

Memory Thrashing

Furthermore, that credit card, that extra 2 GB we loaned, will come back to bite, some day.

Memory oversubscription is effected by evicting memory from the GPU when required. While a non-UVM environment does not support allocations exceeding N GB, a UVM platform quietly evicts (analogous to swapping out) memory when oversubscription is hit.

And, it might need to do so again, when the evicted datum is re-faulted. And, again. And again. We call this thrashing.

And, thrashing is when the device driver actually receives more faults than it can process, a full deluge proportional to the GPU compute bandwidth. Undersubscribed workloads stop throwing around spanners once their (N − 1) GB or whatever has been migrated to the device, because there are no evictions. (N + 1), as discussed earlier, might be terrible.

Aside: the driver does not actually receive that many faults: the interrupt interface and handler (APIC) cannot admit and forward that many. Nor can the GPU MMU. Each stage has some ratelimiting logic, either in software or in hardware. How all of this still “works out” is that most of this deluge is duplicate faults.

This textbook example should help: an LRU cache eviction policy, for a program accessing K + 1 array elements on loop, always evicts the datum needed in the next step, and sees a lot of cache misses.
UVM, oversubscription, and thrashing, is yet another tale of locality, evictions, and subpar intelligence.
You’ll find more in the filesystems world (the page cache), networks, and of course, microarchiture.

This deluge of faults might well be a phenomenon worth studying: it is fully possible that the evictions triggered by a migration are as in-demand (hot pages) on the device as the migrated pages themselves. UVM with thrashing is a great mess, microarchitecurally and memory-wise.

What can we do?

The strongest tool in our armoury (as the programmer OR the device driver) is to remote-map data between the GPU and the large CPU memory: d_faulting is not d_resident, but a special remote mappings, backed by a special hardware unit, bridges the two. v is now visible on two devices, not one.

And mind you, this is a hardware feature. The latency of such memory accesses is much lower than that of servicing a far-fault. Of course, it is much higher than the case where d_faulting = d_resident.

Other options, which I do not work on, are rewriting CUDA programs with intelligent guesses about the UVM platform, or using a better eviction policy. (I strongly suspect the eviction policy is nearly optimal.)

How can we do it?

The question really is which exact pages to remote-map, when, how aggressively, etcetra.
How we do it is obviously by changing the driver. I’m skipping the technicalities.

How we decide to do these things, viz how we improve the policy, is by profiling the prerogatives of the virtualization layer. There are no UVM profilers per se, out there. Existing NVIDIA profilers for non-UVM programs fall short, because the elephant in the room is oversubscription and thrashing. Not HBM throughput and the 2015 roofline.

My group and I have done a fair bit of profiling, but I cannot describe it here. I hope it’ll be published shortly, and in the public domain. Cheers.

Sample Observations

Classified.

Acknowledgements

Converted to HTML using pandoc.