# When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference > A distinguished-architect deep dive into GPU virtual memory internals, MMU fault pipelines, TLB shootdown mechanics, page-table walks, Unified Memory/HMM coherence, ATS, and why page migration turns your p99 into a hardware problem nobody on the team budgeted for. - Author: Hazem Ali - Published: 2026-02-12 - Reading Time: 45 minutes read - Tags: LLMs, GPU, Virtual Memory, CUDA, Inference, MMU, Page Faults, Systems Architecture - URL: https://drhazemali.com/blog/when-your-llm-trips-the-mmu - Source: https://drhazemali.com --- # When Your LLM Trips the MMU ## Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference I want to talk about something most AI teams only discover after an incident, usually on a Friday night, usually at p99. Not quantization. Not batching. Not attention kernels. **Virtual memory.** The unglamorous machinery under everything we ship. The thing we assume is "solved" because we stopped teaching it in onboarding a decade ago, until we serve a large model, push utilization past eighty percent, oversubscribe memory by a margin someone called "fine in staging," and realize the GPU can page fault too. Then latency stops being an application property and becomes a memory-management story. And memory management is not your team's specialty, because nobody's team has a virtual-memory specialist anymore. I have spent over twenty years watching systems fail in ways that trace back to address translation. I watched it happen on SPARC. I watched it happen on x86 when NUMA surprised people who thought flat memory was real. I watched it happen on early GPGPUs when people assumed `cudaMemcpy` was the only contract they needed to honor. Now I am watching it happen at a scale that dwarfs all of those, because the working sets are hundreds of gigabytes and the latency contracts are single-digit milliseconds. If you have been reading my Microsoft Tech Community writing, you already know the theme I keep returning to: in production, behavior is rarely "just the model." It is the runtime. This piece takes that same lens and aims it at the deepest layer we almost never discuss in AI architecture reviews: **the GPU MMU pipeline, page-table walks, TLB hierarchy and shootdown mechanics, page migration DMA engines, and the coherence contract between CPU and GPU address spaces.** > We didn't change the model. > You are right. > You changed the residency story. And residency is policy. Policy you did not write down. Policy the hardware is enforcing anyway. --- ## Part I: The Foundations Nobody Revisits ### Virtual memory is not an abstraction. It is a machine. Every engineer learns virtual memory in school. Almost nobody retains the mechanical reality of it, because abstractions are comforting and page tables are not. Here is what actually happens when a processor issues a load instruction against a virtual address: 1. The virtual address is split into fields. On x86-64 with 4-level paging, that is 9 bits for PML4, 9 for PDPT, 9 for PD, 9 for PT, and 12 for the page offset. On a GPU, the split depends on the architecture, but the structure is analogous: a multi-level radix tree that maps virtual pages to physical frames. 2. The hardware first checks the **Translation Lookaside Buffer (TLB)**, a small, fully associative or set-associative cache of recent virtual-to-physical translations. On a modern CPU, the L1 dTLB holds 64 to 96 entries for 4 KB pages and a handful of entries for 2 MB or 1 GB huge pages. On a GPU, the TLB hierarchy is wider but shallower, because the GPU trades per-thread TLB depth for aggregate throughput across thousands of concurrent threads. 3. On a TLB miss, the hardware initiates a **page-table walk**. On x86, this is done by a dedicated hardware walker that issues memory reads against the page-table structures in physical memory. Each level of the walk is a dependent load: you cannot read the PD entry until you have resolved the PDPT entry, because the PDPT entry tells you the physical address of the PD. Four levels means four dependent memory reads in the worst case. If any of those reads misses in the data cache, you are waiting on DRAM latency for each level. 4. If the walk completes and the page is present and accessible, the translation is installed in the TLB and the original instruction retires. If the page is not present (the Present bit is clear), or the access violates permissions, the hardware raises a **page fault**, and control transfers to the operating system. That is the CPU story. The GPU story is structurally similar but operationally different in ways that matter enormously for AI inference. ### The GPU MMU: same concept, different physics A GPU does not have a single thread of execution waiting on a translation. It has tens of thousands. NVIDIA's Hopper architecture, for example, can have up to 2048 concurrent threads per SM, and there are over 100 SMs on the H100. That is hundreds of thousands of threads that could, in principle, issue memory accesses simultaneously. The GPU MMU is designed for this. It has a TLB hierarchy that is shared across warps within an SM, and the page-table walker can service multiple outstanding walks. But the fundamental contract is the same: if the translation is not in the TLB and the page-table walk reveals a non-resident page, the GPU faults. Here is where it gets interesting. A CPU page fault stalls one thread, or one core in the worst case. A GPU page fault can stall an entire warp of 32 threads, and depending on the fault handling mechanism, it can stall scheduling on the SM while the fault is in flight. If enough SMs fault simultaneously, you have effectively stalled the entire GPU. This is not a theoretical concern. This is what happens when you oversubscribe GPU memory and the working set shifts. ### What a GPU page-table walk actually costs On a CPU, a four-level page-table walk costs roughly four times the DRAM latency in the worst case, which is around 200 to 400 nanoseconds depending on your memory subsystem. In practice, page-table entries are often cached in the L1 and L2, so walks complete faster. On a GPU, the page-table structures live in device memory, or in Unified Memory configurations, potentially in system memory. Device memory on an H100 has roughly 300 to 400 nanoseconds of access latency from the perspective of the memory controller. A multi-level walk incurs that latency per level. But the GPU has more tolerance for latency because it hides it through massive thread-level parallelism, assuming other warps are ready to execute. The problem is when the walk terminates in a fault. Now you are not hiding latency with other warps. You are waiting for a fault to propagate through the GPU's fault buffer, get picked up by the CPU-side driver, get processed, potentially trigger a page migration across PCIe or NVLink, and then get resolved with a TLB invalidation and retry. That is not nanoseconds. That is microseconds to milliseconds. And that is a different category of problem entirely. ### The TLB hierarchy in detail On NVIDIA Hopper, the TLB is structured in at least two levels: - **L1 TLB**: per-SM, services translation requests from the load/store units. Covers a limited number of pages. A miss here triggers a walk or an L2 TLB lookup. - **L2 TLB**: shared across a GPC (Graphics Processing Cluster), larger capacity. A miss here triggers a full page-table walk in memory. The critical insight is that the L1 TLB is small relative to the working set of a large model. A 70B parameter model in FP16 occupies roughly 140 GB. At 4 KB page granularity, that is about 36 million pages. At 2 MB granularity, if huge pages are used, that is about 70,000 pages. The L1 TLB cannot hold even a fraction of those translations. This means that during inference, the GPU is almost continuously walking page tables for the weight tensors, unless the access pattern has enough temporal locality that the TLB retains the hot translations. For decode steps, where attention accesses are relatively predictable, this is often fine. For prefill, where the model touches large swaths of the parameter space, TLB pressure can be significant and is almost never measured. --- ## Part II: When the GPU Faults ### The anatomy of a GPU page fault When a GPU thread accesses a virtual address and the page-table walk reveals that the page is not resident, the following sequence occurs. I am reconstructing this from NVIDIA's published documentation, research papers, and profiling observations gathered over years of production systems work: 1. **Fault detection**: The MMU detects that the page is not present or has insufficient permissions. The faulting access is recorded. 2. **Fault buffer write**: The fault information, including the faulting virtual address, the type of access, and the SM identity, is written into a **GPU-side fault buffer**. This is a hardware structure in device memory that accumulates faults. 3. **Fault notification**: The GPU signals the CPU-side driver that faults are pending. This typically happens via an interrupt or a polling mechanism, depending on the driver's configuration. 4. **Fault retrieval**: The CPU-side NVIDIA driver reads the fault buffer, de-duplicates faults (multiple threads may have faulted on the same page), and determines the set of pages that need to be made resident. 5. **Page resolution**: For each faulted page, the driver determines where the page currently lives, whether in host memory, on another GPU, or nowhere, and initiates the appropriate action: allocation, migration, or mapping. 6. **DMA transfer**: If migration is required, the driver programs the GPU's copy engine (CE) or uses GPUDirect mechanisms to move the page data. For a 4 KB page over PCIe Gen5, the raw transfer time is small, under a microsecond, but the setup overhead including DMA descriptor programming, IOMMU translation, and synchronization can dominate. For a 64 KB or 2 MB migration unit, the data transfer is larger but the per-page overhead is amortized. 7. **Page-table update**: The driver updates the GPU page table to map the newly resident page, ensuring the translation is correct. 8. **TLB invalidation**: The old, invalid translation must be purged from any TLB that may have cached it. On a GPU, this is a **TLB shootdown**, analogous to the CPU's inter-processor TLB invalidation but scoped to the GPU's SM fabric. 9. **Fault replay**: The GPU retries the faulting instructions. If the page is now resident, execution proceeds. If not, due to a race or a cascading fault, the cycle repeats. This entire pipeline, from fault detection to replay, takes on the order of **10 to 100+ microseconds** depending on whether migration is involved, how far the data has to travel, and how loaded the fault-handling path is. Compare that to a normal memory access latency of 300 to 400 nanoseconds. The fault path is 100 to 1000 times slower. ### Why 10 microseconds is not "fine" I hear this dismissal regularly. "Ten microseconds? That is nothing. Our step time is 20 milliseconds." If you are generating tokens at 50 tokens per second, each decode step takes about 20 milliseconds. A 10-microsecond fault seems negligible, roughly 0.05% of the step time. But faults do not arrive uniformly. They cluster. When the working set shifts, when a new KV cache segment is touched for the first time, when a batch rearrangement triggers a different access pattern, you get bursts of faults. Twenty faults in a burst is 200 microseconds. Two hundred faults is 2 milliseconds, which is 10% of your step budget. And during those faults, the SMs that are stalled cannot make progress on other work, because the warp scheduler cannot issue instructions from warps that are blocked on memory. This is why faults destroy p99 and not p50. At p50, faults are rare or absent. At p99, you are sampling the moments when the working set shifted and the fault machinery fully engaged. I have seen teams spend months optimizing attention kernels to save 0.5 milliseconds, while ignoring a page-fault tail that adds 5 milliseconds at p99. That is a strategic failure, not a technical one. ### The TLB shootdown problem TLB shootdowns on GPUs are less discussed than on CPUs, but they are arguably more consequential. On a CPU, a TLB shootdown requires sending an inter-processor interrupt (IPI) to every core that may have cached the invalidated translation, waiting for each core to flush its TLB entry, and then acknowledging the invalidation. On a busy system with dozens of cores, this can take multiple microseconds and stall the initiating core while it waits for acknowledgments. On a GPU, the shootdown must reach every SM that may have cached the translation. With over 100 SMs on an H100, this is a broadcast invalidation across the GPU's internal fabric. The invalidation itself is fast because the GPU has dedicated hardware for it, but the consequence is that any SM that had the translation cached must now re-walk the page table on the next access to that page. If many SMs were actively using that translation, for example because the page contained a hot weight tensor, the re-walk storm can temporarily saturate the page-table walker. In AI inference, this manifests as a brief but sharp drop in SM utilization immediately after a page migration. Nsight Systems will show it as a dip in the "Warps Active" counter that correlates with migration events. Most teams see these dips and attribute them to "scheduling jitter." They are not jitter. They are shootdowns. --- ## Part III: The Memory Hierarchy Nobody Draws on the Whiteboard ### It is not "GPU memory" vs "CPU memory" anymore The old model was binary: data is on the GPU or it is not. That model died with Unified Memory, but the mental model persists in most teams' architecture diagrams. The actual memory hierarchy in a modern inference node is deeper and more treacherous than people realize: 1. **GPU registers**: per-thread, the fastest storage available. Not pageable, not addressable beyond the thread. 2. **GPU shared memory and L1 cache**: per-SM, nanosecond access. SRAM, explicitly managed or cached. Not pageable. 3. **GPU L2 cache**: shared across the entire GPU, tens of nanoseconds access latency. Not directly pageable, but it caches page-table entries and data from device memory. Its behavior under pressure directly affects page-table walk performance. 4. **GPU device memory (HBM)**: the main "VRAM." Hundreds of nanoseconds access latency. This is what people mean when they say "GPU memory." Pages are resident here in the common case. 5. **NVLink-connected peer GPU memory**: accessible via NVLink at lower latency than PCIe but higher than local HBM. Pages can be migrated here in multi-GPU setups. Peer access without migration is also possible but incurs NVLink latency on every access. 6. **CPU system memory (DRAM)**: accessible via PCIe or the C2C link on Grace Hopper. Much higher latency from the GPU's perspective. Pages can be migrated here when the GPU is under memory pressure, or they can start here and fault-in on demand. 7. **NVMe and SSD storage**: not directly pageable by the GPU, but the CPU can orchestrate loads into system memory, which then becomes available through the Unified Memory system. Some experimental systems use GPUDirect Storage to bypass the CPU, but this is not part of the standard inference stack. Unified Memory and HMM blur the boundary between levels 4, 5, and 6. A virtual address might map to any of those physical locations, and the system migrates data between them based on access patterns and policy. The problem is that each level transition has a different cost, and the cost is not just latency but also control-plane overhead: page-table updates, TLB shootdowns, DMA programming, driver work. AI inference workloads routinely touch data at levels 4, 5, and 6, sometimes within the same decode step. ### The page-size question and why it matters more than you think Most GPU virtual memory systems use a base page size of 4 KB, with support for 64 KB and 2 MB "huge" pages. This is not a tuning knob. It is an architectural decision with profound implications. **TLB coverage**: A 4 KB page requires one TLB entry per 4 KB of address space. A 2 MB page requires one entry per 2 MB. For a 140 GB model, you need roughly 36 million TLB entries at 4 KB but only about 70,000 at 2 MB. The difference between constant TLB misses and a working TLB. The difference between page-table walks on every tensor access and translation hits. **Migration granularity**: When a page fault triggers a migration, the entire page is moved. At 4 KB, you migrate 4 KB, which means many small DMA transfers with high per-transfer overhead. At 2 MB, you migrate 2 MB, which means fewer transfers but more data moved per fault. For AI workloads where access patterns are often sequential, scanning through weight matrices row by row, 2 MB pages can be dramatically better because one fault resolves a large contiguous region that will be accessed soon anyway. **Internal fragmentation**: Larger pages waste more space when the allocation does not align to the page boundary. For large tensor allocations, this is rarely a problem because tensors are typically megabytes to gigabytes. For small metadata allocations mixed into the same address space, it can waste significant memory. **Page-table memory consumption**: Larger pages mean fewer page-table entries, which means the page-table structures themselves consume less memory. At 4 KB granularity, the page tables for 140 GB of address space can consume hundreds of megabytes of device memory. That is memory you cannot use for your model. In practice, CUDA's memory allocator uses a mix of page sizes, and the Unified Memory system can use different granularities for migration. If you are not explicitly managing this, the system is making choices for you, and those choices directly impact your TLB hit rate and fault behavior. ### PCIe is your real bottleneck, not HBM bandwidth When a page migrates from CPU to GPU over PCIe Gen5 x16, the theoretical bandwidth is about 63 GB/s in each direction. That sounds fast. It is not, when you account for the realities of the protocol: **Protocol overhead**: PCIe is a packetized protocol. Each transaction has header overhead, and small transfers do not saturate the link. A 4 KB page transfer uses a 4 KB payload with 16 to 24 bytes of header per TLP (Transaction Layer Packet), plus flow-control credits and acknowledgments. Effective throughput for small transfers can be 30 to 50 percent of theoretical peak. **DMA engine scheduling**: The GPU's copy engine must be programmed for each transfer. If there are many small transfers because many 4 KB pages faulted, the CE scheduling overhead can dominate the total time. Batching helps, but batching requires the driver to accumulate multiple faults before initiating transfers, which trades latency for throughput. **Link contention**: The PCIe link is shared with other traffic. NVMe storage I/O, network traffic if not on a separate fabric, CPU-to-GPU command submission, telemetry reads. Page migration traffic competes with all of this. **Bidirectional interference**: If the GPU is simultaneously migrating pages in (HtoD) and evicting pages out (DtoH), the PCIe link carries traffic in both directions. Gen5 supports full-duplex at the protocol level, but the endpoints may have asymmetric capabilities or internal contention. NVLink changes the economics: NVLink 4.0 on Hopper provides 900 GB/s bidirectional bandwidth between GPUs, and Grace Hopper's C2C link provides 900 GB/s between CPU and GPU. At those bandwidths, migration latency is dominated by setup overhead, not data transfer time. This is one of the fundamental reasons NVIDIA's Grace Hopper architecture matters: it does not just change the bandwidth story. It changes the page-migration economics. --- ## Part IV: The LLM Inference Pressure Cooker ### Why this is the worst-case workload for virtual memory I have worked on storage systems, databases, HPC simulations, and real-time trading engines. None of them stress virtual memory the way LLM inference does. Here is why, and I want to be specific about the mechanisms: **The working set is enormous and non-uniform.** A 70B parameter model in FP16 occupies roughly 140 GB of weight tensors. During prefill, the model touches nearly all of them as it processes the input prompt through every layer. During decode, the access pattern narrows to the layers being computed for the current token, but the KV cache grows with every generated token. The total addressable footprint can easily exceed physical device memory, especially with batched requests and long contexts. **The access pattern has phase transitions.** Prefill and decode are fundamentally different memory access patterns. Prefill is bandwidth-hungry and largely sequential: the model streams through weight matrices for the entire prompt. Decode is latency-sensitive and more scattered: attention must access KV cache entries that correspond to previous tokens, and those entries may be spread across the address space. The transition between phases can trigger a wave of TLB misses and potential page faults as the hot working set changes abruptly. **Concurrency multiplies everything.** Continuous batching means multiple requests at different stages share the GPU simultaneously. Each request has its own KV cache region, its own position in the attention computation, its own hot set of pages. The union of all these working sets is what the TLB and page tables must cover. When a new request arrives or an old one completes, the union changes, and the TLB's cached translations may become irrelevant. **Speculative decoding adds unpredictability.** When a draft model generates speculative tokens, the verification step may accept all of them, which is the happy path, or reject some, which means the GPU did work and accessed memory for tokens that are being discarded. Those memory accesses may have triggered page faults for pages that will not be needed again, wasting fault-handling budget. **Multi-tenant deployments compound the problem.** LoRA adapters, per-tenant KV caches, varying context lengths, and different model configurations create a non-stationary working set that defies simple prefetch heuristics. What works for tenant A's 4K-context request does not work for tenant B's 32K-context request running on the same GPU at the same time. ### The anatomy of an inference latency spike Let me walk through a concrete scenario. This is composited from real incidents I have observed and helped diagnose, not hypothetical: **Setup**: 70B model on 4x H100 with tensor parallelism. Unified Memory enabled for KV cache overflow. Continuous batching with up to 64 concurrent requests. System tuned for throughput, with memory utilization running at 88 percent. **Steady state at p50**: Everything fits. Weights are resident on each GPU. KV caches are in device memory. TLBs are warm because the access patterns are repetitive. Decode latency is 18 ms per token. The dashboard is green. Everyone is happy. **Load increases, p95 territory**: Total KV cache footprint approaches device memory capacity. The allocator starts placing new KV cache pages in address regions that were recently freed and may not have been prefetched. Occasional TLB misses on KV cache access, resolved by page-table walks but no faults because the pages are still resident. Minor increase in memory subsystem pressure. Decode latency creeps to 22 ms. Alerts do not fire. Nobody investigates. **Threshold crossing, p99**: A long-context request arrives, maybe a 32K-token prompt from a tenant who just increased their context window. Its KV cache does not fit in the remaining device memory. The Unified Memory driver begins migrating cold KV cache pages from other requests to host memory to make room. Now, when the attention kernel accesses those migrated pages, it faults. Each fault takes 30 to 50 microseconds to resolve, including PCIe migration. The attention computation for this request stalls, and because continuous batching shares the SM pool, other requests' compute is also delayed. Decode latency spikes to 80 ms. The SLO is violated. **Cascade at p99.9**: While the GPU is handling faults from the long-context request, the fault-handling path becomes saturated. The fault buffer fills up. Other requests that happen to touch recently-migrated pages also begin faulting. The PCIe link becomes congested with bidirectional migration traffic, some pages moving in and others being evicted out. CPU-side fault processing consumes cores that would otherwise be used for request scheduling and tokenization. Decode latency hits 200 ms for multiple requests. The monitoring system fires alerts. Someone gets paged. The incident review will conclude that the model "became slow" and the team will "investigate the scheduler." The model did not become slow. The residency policy failed under load. Nobody instrumented the control plane that failed. --- ## Part V: The Math Nobody Writes Down ### Modeling fault probability per step If you want a simple way to communicate this to leadership, I like to frame it as a tail-risk multiplier. Not because the math is precise, but because it gives people the right intuition about what they are gambling on. Let: - $t_k$ = your kernel compute time for a decode step - $t_f$ = the mean service time of a far page fault (including migration) - $\lambda$ = the rate of far faults per step (or per token) - $P$ = the probability a step experiences at least one far fault A coarse model for the probability of at least one fault per step, assuming faults arrive as a Poisson process: $$ P \approx 1 - e^{-\lambda} $$ Expected step time under this model: $$ \mathbb{E}[T] \approx t_k + P \cdot t_f $$ ### Why the Poisson assumption breaks at the tail Now the p99 is where it gets ugly, because faults cluster. Under certain access patterns, faults are not independent events. They come in bursts when the working set crosses a residency boundary. The Poisson model, which assumes independent arrivals, underestimates the tail. A better model for burst faults uses a compound Poisson process where the number of faults per burst follows a geometric or log-normal distribution. But the precise statistical model matters less than the intuition: at p99, you are sampling the moments when a batch transition, a new request arrival, or a cache eviction triggered a cascade of faults that the Poisson model never predicted. This is why you can see a system that looks fine at p50 and p95, then collapses at p99 without any obvious "compute saturation." Every metric that teams normally track, GPU utilization, memory bandwidth, SM occupancy, looks fine at the median. The tail is not a compute problem. It is a residency problem. The compute is not the bottleneck. The residency control loop is. ### Fault amplification in multi-GPU settings In tensor-parallel configurations, a page fault on one GPU does not just stall that GPU. It stalls the collective communication that depends on that GPU's output. If GPU 2 faults during its shard of the attention computation, GPUs 0, 1, and 3 are waiting at the AllReduce barrier. One fault becomes a system-wide stall. The effective latency of a step in a tensor-parallel setup is dominated by the slowest GPU: $$ T_{\text{effective}} = \max_{i \in \text{GPUs}} \left( t_{\text{compute},i} + n_{\text{faults},i} \cdot t_f \right) $$ The max function is cruel. The probability that at least one GPU out of four faults on a given step is much higher than the probability that any specific GPU faults. If each GPU has a 2% chance of faulting on a given step, the probability that at least one faults is roughly 8%. At p99, multi-GPU setups amplify fault visibility by a factor roughly proportional to the GPU count. --- ## Part VI: The Coherence Problem ### Unified Memory, HMM, and who owns the page table There is a subtle but architecturally profound shift happening in the ecosystem: moving from a world where the GPU driver maintains its own page tables as a private structure, toward a world where the system aims to unify address translation across CPU and GPU under a single coherence domain. **NVIDIA's trajectory**: CUDA Unified Memory started as a driver-managed system where the GPU had its own page tables, maintained by the NVIDIA driver, providing a mirror of the CPU's virtual address space. With HMM support in newer drivers and Linux kernels, the GPU can participate more directly in the kernel's memory management subsystem, using the same `struct page` and VMA primitives. Grace Hopper takes this further with Address Translation Service (ATS), where the GPU can directly use the CPU's page tables via a coherent interconnect, eliminating the need for mirroring entirely. ATS means the GPU sends translation requests to the CPU's IOMMU, which walks the CPU's page tables on the GPU's behalf. **AMD's trajectory**: ROCm's HIP Unified Memory explicitly states that a device access to non-resident memory triggers a page fault and a request for the page from host or another device, followed by unmap, transfer, and map on the destination. The XNACK hardware feature on AMD GPUs enables GPU-side page fault handling, but it has historically been a significant performance variable. Some workloads run faster with XNACK off, meaning no page faults allowed and explicit management required, and others benefit from XNACK on with transparent migration. Knowing which regime your workload falls into is not optional. **The industry convergence**: Both major GPU vendors are heading toward the same destination. The GPU is becoming a first-class participant in the operating system's virtual memory system. This means your inference architecture must start treating page migration and address translation as first-class performance variables, not as invisible plumbing that "the driver handles." ### The coherence tax you are already paying Even when pages are resident and no faults occur, the coherence protocol between CPU and GPU imposes costs that most teams never measure: **Snooping**: On systems with hardware coherence, like Grace Hopper and AMD MI300, the GPU's memory accesses may trigger snoop requests to the CPU's cache hierarchy, ensuring that both CPU and GPU see consistent data. These snoops consume interconnect bandwidth and can stall GPU accesses if the CPU's cache hierarchy is slow to respond. In AI inference, where the CPU is often busy with tokenization, scheduling, and fault handling, snoop latency can be non-trivial. **Memory ordering**: CUDA's memory model defines specific ordering guarantees for different scopes: thread, block, device, and system. System-scope atomics on Unified Memory must ensure that both CPU and GPU see updates in a consistent order. On software-coherent systems, this may require explicit cache flushes and invalidations. On hardware-coherent systems, the hardware enforces it, but at a latency cost that depends on the interconnect and the cache state. **False sharing at page granularity**: If the CPU and GPU access different data within the same page, the page can ping-pong between host and device memory. This is the virtual-memory equivalent of CPU cache-line false sharing, but at page granularity the cost is orders of magnitude higher. A cache-line bounce costs hundreds of nanoseconds. A page bounce costs tens of microseconds. In mixed workloads where the CPU writes metadata and the GPU reads tensor data from nearby addresses, page-level false sharing can be a silent killer. ### The subtle trap: atomics as fault generators CUDA's Unified Memory documentation includes a detail that should make any distinguished architect pause: on software-coherent systems, atomic accesses to Unified Memory may incur page faults and significant latencies, while hardware-coherent systems behave differently. That is a polite way of saying: if you build synchronization or shared-state patterns across CPU and GPU and you assume they are "just memory operations," you may be building a fault generator. This matters in production systems where teams routinely build: - shared counters for request scheduling and load balancing - host-driven memory metadata updates for dynamic batching - mixed CPU/GPU access to the same scheduling structures in tight loops - flag-based synchronization between CPU preprocessing and GPU compute Your control plane can become your fault plane. And the worst part is that it will work perfectly in testing, where the load is low and the pages are resident, and fail catastrophically in production, where contention triggers the migration ping-pong. --- ## Part VII: What the Hardware Is Actually Doing ### Inside the GPU's copy engine When the driver decides to migrate a page, it does not simply issue a memcpy. The GPU has dedicated **copy engines (CEs)** that are independent of the compute SMs. These are DMA engines that can move data between address spaces without consuming SM resources. On Hopper, there are multiple CEs, and they can operate concurrently with compute and with each other. But they share the memory fabric: the L2 cache, the HBM controllers, the NVLink and PCIe interfaces. Migration traffic competes with compute traffic for memory bandwidth. Under heavy migration, you can see HBM bandwidth utilization spike without any corresponding increase in compute throughput, because the bandwidth is being consumed by page copies. The driver must orchestrate a complex sequence for each migration: 1. Allocate a physical page on the destination, whether device or host. 2. Program the CE with source and destination physical addresses and transfer size. 3. Initiate the DMA transfer. 4. Wait for completion or set up a callback for asynchronous notification. 5. Update the page table to reflect the new physical mapping. 6. Issue TLB shootdowns to invalidate stale translations across all SMs. 7. Signal the faulting SMs to retry their stalled instructions. Steps 1 through 3 involve CPU work, because the driver runs on the CPU. Steps 4 through 7 involve GPU work and synchronization. The round trip between CPU and GPU for fault handling is one of the fundamental bottlenecks. Every microsecond spent in the driver is a microsecond where the GPU is waiting. ### The fault buffer: a hardware queue with finite depth The GPU's fault buffer is a hardware FIFO that accumulates fault records. When the buffer is full, additional faults cannot be recorded, and the faulting SMs stall harder, waiting not just for their page but for buffer space to even register their fault. This creates a dangerous feedback loop that I have seen bring production systems to their knees: 1. A high fault rate fills the buffer. 2. Faulting SMs stall, waiting for buffer space. 3. The driver reads the buffer, but processing each fault takes time because it involves page allocation, migration decisions, and DMA programming. 4. While the driver processes the current batch, more faults arrive from other SMs. 5. If fault processing cannot keep up with fault generation, the GPU is effectively stalled in a livelock between fault generation and fault servicing. This is the GPU equivalent of a CPU being overwhelmed by interrupts. It is a denial-of-service attack by your own workload against your own fault-handling infrastructure. The system is not broken. It is faithfully executing the policy you gave it. The policy just happens to be pathological under load. ### Page-table walker contention: the second-order bottleneck The GPU's page-table walker is a shared resource within each SM. When many warps miss in the TLB simultaneously, for example during the first pass over a large tensor after a working-set shift, the page-table walker can become a bottleneck even if no faults occur. The walker must issue memory reads to traverse the page-table levels. Those reads go through the L2 cache and, on cache miss, to HBM. If many walks are in flight across many SMs, they compete for L2 cache bandwidth and HBM bandwidth with the actual compute workload. The symptom is subtle: reduced SM utilization and lower-than-expected compute throughput without any obvious memory bandwidth saturation in the standard metrics. The bottleneck is not data bandwidth but **translation bandwidth**. Nsight Compute can reveal this through the TLB hit rate and memory subsystem utilization metrics, but almost nobody checks those counters in AI inference profiling. --- ## Part VIII: Practical Patterns That Break in Production ### "Host offload" without a fault model If your inference engine includes phrases like: - "offload KV cache to host" - "spill activations to host" - "use Unified Memory to simplify memory management" - "oversubscribe and let it page" Then you need a fault model. Not a guess. Not a hope. A model. Because "host offload" is not a single feature. It is a chain of mechanisms, and every link in the chain has a failure mode: - **Page placement policy**: Which pages go to host, which stay on device? Is this static or adaptive? What happens when the adaptive policy is wrong? - **Migration triggers**: Are pages migrated on demand (when a fault occurs) or proactively (when the driver predicts they will be needed)? Demand migration is reactive and adds latency. Proactive migration requires prediction, and prediction can be wrong. - **Prefetch strategy**: Can you predict which pages will be needed next? For weight tensors, yes: the access pattern is deterministic. For KV cache, it depends on the request schedule, which is not deterministic. - **TLB invalidation costs**: How many SMs are affected by each migration? If a hot weight page is migrated, the shootdown affects every SM. - **Interconnect saturation**: What happens when migration traffic competes with other PCIe traffic? Can migration traffic delay GPU command submission? - **Concurrency interactions**: Does fault handling block other work? If the driver is busy processing faults, can it still service other requests like memory allocation and kernel launch? NVIDIA's own guidance on Unified Memory performance has emphasized that fault processing includes de-duplication, mapping updates, and DMA transfers, and that fault handling can add significant overhead especially for streaming access patterns. The lesson is not "don't use Unified Memory." The lesson is: **if you use it, you are adopting a policy engine.** And policy engines have regimes. You need to know which regime you are in, what triggers a transition between regimes, and what happens in the regimes you did not test. ### First-touch faults: the hidden cold start inside your hot path Here is a minimal CUDA example. It looks innocent. It can also manufacture faults if you do not control placement. ```cpp #include #include __global__ void touch(float* x, size_t n) { size_t i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = x[i] * 1.01f + 1.0f; } int main() { const size_t n = 1ull << 28; // ~1 GB of floats float* x = nullptr; cudaMallocManaged(&x, n * sizeof(float)); // Without prefetch, the first GPU access to each page faults. // With 4 KB pages, that is 262,144 faults for 1 GB. // At 30 us per fault, that is ~8 seconds of fault handling. int dev = 0; cudaGetDevice(&dev); // Prefetch moves pages proactively. No faults on first access. cudaMemPrefetchAsync(x, n * sizeof(float), dev); cudaDeviceSynchronize(); dim3 block(256); dim3 grid((n + block.x - 1) / block.x); touch<<>>(x, n); cudaDeviceSynchronize(); cudaFree(x); } ``` The point is not the API calls. The point is that **first-touch can become first-fault**, and if that happens inside a latency-critical path, you have created a hidden cold-start regime inside what you believed was steady state. In LLM inference, "steady state" is already a fragile illusion because continuous batching means new requests with new KV caches, new address regions, and new first-touch surfaces are constantly arriving. Every new request is a potential cold-start event at the virtual memory layer. ### The allocator fragmentation trap Most CUDA memory allocators, including the caching allocator in PyTorch and the arena allocators in TensorRT-LLM, work by pre-allocating large blocks and suballocating from them. This is efficient when the block is contiguous in both virtual and physical address space. But under memory pressure, the allocator may return a block that is virtually contiguous but physically fragmented: the underlying pages are scattered across different HBM banks or, worse, some pages are in host memory because the Unified Memory system migrated them. The virtual address looks fine. The `data_ptr()` is contiguous. The physical reality is a minefield. Physically fragmented allocations cause: - **Poor HBM bank utilization**: Some banks are hot, others cold, leading to uneven bandwidth utilization and higher effective latency. - **Unpredictable TLB behavior**: The virtual-to-physical mapping is not stride-friendly, so prefetching-friendly TLB patterns break down. - **Hidden host residency**: Some pages may be on host memory without the application knowing, and those pages will fault on access. This is completely invisible to the application. The tensor looks normal. The shapes are right. The dtypes are right. But the memory subsystem is doing extra work behind the scenes, and that work shows up as latency variance that nobody can explain because nobody is looking at the right layer of the stack. --- ## Part IX: The Observability Gap ### What your dashboard is not showing you Most inference dashboards track the same five metrics: - tokens per second - GPU utilization percentage - KV cache usage - memory allocated versus total - request queue depth Almost nobody tracks: - GPU page faults per second, instantaneous and windowed - migrated bytes per second, HtoD and DtoH separately - fault bursts correlated with specific prompts, tenants, or context lengths - CPU time spent in the NVIDIA UVM driver's fault servicing path - TLB miss rate and page-table walker utilization - copy engine utilization - PCIe bandwidth consumed by migration versus useful data traffic like model loading Yet the tooling is trying to tell you this is important. NVIDIA Nsight Systems shows these timelines explicitly, including GPU page fault and data migration rows. Nsight Compute can show TLB hit rates and translation overhead per kernel. The data is there. We just do not put it on the production dashboard because nobody asked for it. Nobody asked for it because nobody knows it matters. And nobody knows it matters because they have never diagnosed a p99 spike to its root cause in the MMU. If you want a practical posture, start here: > If I can't graph page faults and migration bandwidth next to p99, I'm blind. ### Building a fault-aware telemetry stack Here is what I recommend as a minimum viable observability layer for any team serving models at scale: **NVIDIA DCGM metrics**: Export NVLink TX and RX bandwidth counters to track inter-GPU migration traffic. Export PCIe TX and RX rates to track host migration. These counters are available through DCGM and can be scraped by Prometheus. **Nsight Systems periodic profiling**: Run 10-second Nsight captures periodically in production. The overhead is low enough for sampling, single-digit percent CPU cost. Look for GPU Page Fault and Data Migration rows. Automate the analysis: flag captures where fault count exceeds a threshold. **Custom CUDA event timers**: Instrument your inference pipeline to measure the wall-clock time of each decode step with CUDA events. When a step exceeds two times the median, flag it and correlate with GPU counters from the same time window. **Kernel-side fault counters**: If using Unified Memory, the CUDA driver exposes fault counters via `cudaDeviceGetAttribute` and through the UVM kernel module's procfs interface. Export these to your monitoring system. **CPU profiling correlation**: Use `perf` or `bpftrace` on the host to measure time spent in the NVIDIA driver's fault-handling path. The relevant kernel functions include the `nvidia_uvm_*` family. If you see significant CPU time in these functions, you have a fault problem, not a compute problem. --- ## Part X: Architecture-Level Decisions ### A production playbook that does not pretend faults do not exist If you are serving large models at scale, you should make three decisions explicitly, and write them down like policy. Not in a design doc that nobody reads. In the runbook. In the SLO definition. In the capacity planning spreadsheet. ### 1) Residency contract Decide what must always be resident on device. In most production deployments, that includes: - **Model weights**: The entire parameter set, or the shards for your tensor-parallelism group. These should never be evicted. If they are migrated to host, your inference latency is no longer a function of your compute, it is a function of your PCIe bandwidth. - **Hot KV cache tiers**: The most recently accessed KV entries for active requests. Define "hot" concretely: the last N tokens, or the entries accessed within the last M milliseconds. - **Routing tables and safety layers**: Any data structures that are accessed on every request and cannot tolerate latency variance. Classifiers, content filters, tokenizer lookup tables. - **The page tables themselves**: Yes, GPU page tables consume device memory. At 4 KB granularity for a large address space, this can be hundreds of megabytes. Budget for it. Everything else is negotiable, but the contract must be explicit. Write it down. Review it when you change model size, context length, batch size, or tenant count. ### 2) Migration contract Decide how pages are allowed to move between memory tiers. If you allow host spill for any data, you need: - **A prefetch strategy**: When to proactively move pages to the GPU before they are needed. For predictable access patterns like weight tensors, prefetch is straightforward. For KV cache, it depends on request scheduling, which may not be predictable. - **An eviction strategy**: Which pages to move out and when. LRU? LFU? Tenant-aware? The eviction policy determines which requests pay the fault tax. - **Tenant isolation rules**: One tenant's fault storm must not affect another tenant's latency. This may require per-tenant memory budgets, separate address spaces, or priority-based fault handling. - **A hard cap on concurrent migrations**: To prevent saturating the PCIe link with migration traffic. Define the cap, measure against it, alert when it is approached. - **A fallback policy**: What happens when the migration budget is exhausted? Reject new requests? Degrade quality by truncating context? Queue the request? This is a product decision, not a systems decision. ### 3) Observability contract Decide what you must measure, and commit to measuring it before the first incident forces you to. At minimum: - GPU page faults per second, both instantaneous and windowed averages - migrated bytes per second, HtoD and DtoH tracked separately - p99 decode latency correlated with fault and migration metrics on the same time axis, in the same dashboard - CPU time attributable to fault servicing in the UVM driver - PCIe and NVLink bandwidth utilization, distinguishing migration traffic from other traffic If you do not measure it, you will rediscover it in an incident. And the incident review will take two weeks because nobody has the tools to diagnose it quickly. --- ## Part XI: The Frontier ### Where the industry is heading The virtual-memory problem in AI inference is not going away. It is getting worse, because models are getting larger, context windows are getting longer, and the economic pressure to oversubscribe expensive GPU memory is increasing as inference becomes a commodity. The industry response is multi-pronged: **Hardware solutions**: Grace Hopper's unified memory architecture with C2C at 900 GB/s dramatically changes the migration economics, making page faults cheaper but not free. CXL (Compute Express Link) promises to extend the memory fabric across the rack, giving GPUs access to terabytes of pooled memory with latencies between local DRAM and remote DRAM. AMD's MI300X integrates CPU and GPU dies on the same package with shared HBM, reducing the migration path to an on-package interconnect. Intel's GPU Max uses a similar approach with unified HBM pools. **Software solutions**: Research systems like DeepSpeed-Inference and FlexGen have explored explicit memory tiering and prefetching for LLM inference, treating the memory hierarchy as a first-class scheduling concern. vLLM's PagedAttention separates the KV cache into virtual "pages," not OS pages but the concept is analogous, to reduce fragmentation and enable more efficient memory management. These systems are starting to treat memory management as an integral part of the inference scheduler, not as a separate concern. **Operating system evolution**: Linux's HMM framework continues to evolve, with better support for device-managed pages and more efficient migration paths. There is ongoing work to expose GPU memory management events to BPF, which would enable custom fault-handling policies and observability without modifying the driver. This could be transformative for production monitoring. **Architectural research**: There are active research efforts to push page-migration decisions closer to the device, using near-memory processing or smart interconnects to reduce the CPU's involvement in fault handling. The recognition is that the CPU bottleneck in the fault path is a fundamental scaling limit, and the only way to scale fault handling is to take the CPU out of the critical path. ### The CXL wildcard CXL 3.0 introduces hardware-managed coherence across a fabric of compute and memory devices. In theory, a CXL-attached memory expander could provide terabytes of additional memory to a GPU cluster with access latencies of 200 to 400 nanoseconds, comparable to local DRAM on a CPU and dramatically better than PCIe page migration. If CXL delivers on its promise, the virtual-memory landscape for AI inference changes fundamentally: instead of page faults triggering multi-microsecond migrations over PCIe, the memory fabric provides direct access to a much larger pool at DRAM-like latencies. The page-fault model becomes less important. But the TLB and translation model becomes *more* important, because you are now managing a much larger physical address space with more diverse latency characteristics. The TLB must cover terabytes, not hundreds of gigabytes, and the page-table structures grow proportionally. This is still futures, but the architectural direction is clear: the memory hierarchy is getting deeper, the coherence domain is getting wider, and the system that manages it all, the MMU, the page tables, the TLBs, the coherence protocols, is becoming the critical path that determines whether your inference system meets its SLO or does not. --- ## A quote I use in every architecture review > When latency spikes look like randomness, you're usually watching a control plane you didn't instrument. In 2026, virtual memory is one of the most common uninstrumented control planes in AI inference. And it is the one most likely to bite you when you scale. --- ## The leadership takeaway: memory virtualization is a product decision Some teams will read this and conclude, "fine, we will avoid Unified Memory entirely." That is not the lesson. The lesson is: every approach has a residency policy. Even "no Unified Memory" is still a policy, just a manual one. And manual policies are the ones that drift silently, because nobody updates them when the model size doubles, the context length triples, or the tenant count quadruples. Manual policies are the ones that work perfectly in the benchmark and fail catastrophically in production, because the benchmark does not exercise the edge case where the policy breaks. The real question is what you are optimizing for: - **Raw throughput**: Maximize tokens per second, tolerate latency variance, pack the GPU as full as possible. - **Tail latency stability**: Bound p99, potentially sacrifice throughput, maintain headroom. - **Multi-tenant safety**: Isolate tenants' memory domains, prevent one tenant's fault cascade from affecting another. - **Reproducibility and auditability**: Deterministic page placement, no migration-induced variance, identical behavior between runs. - **Cost efficiency through oversubscription**: More tenants per GPU, higher risk, lower cost per request. You cannot optimize for all of them equally. The laws of physics and the mechanics of page tables will not allow it. But you can stop being surprised by the MMU. You can stop treating page faults as acts of God. You can build systems that know where their pages are, that measure the cost of moving them, and that make conscious trade-offs between throughput and tail stability. If you are distinguished architect level or above, that is the job: turning surprises into architecture. Turning invisible mechanisms into explicit contracts. Turning "it works on my machine" into "I can tell you exactly what happens when the working set exceeds the TLB coverage at 3 AM on a Saturday." --- ## A closing thought We spent a decade building model intuition. We learned to reason about loss curves, attention patterns, tokenizer pathologies, quantization noise, and prompt engineering. We got very good at it. We built trillion-dollar companies on it. Now we need to build **machine intuition** again. The kind we used to have when we cared about TLB coverage and cache-line alignment and interrupt coalescing and NUMA locality. The kind that tells you, without looking at a dashboard, that a system with 95% GPU utilization and 50 page faults per second is a system that is about to fall off a cliff. Because the next generation of AI incidents will not be caused by "bad prompts" or "wrong hyperparameters" or "the model hallucinated." They will be caused by systems that accidentally made page faults part of the critical path. Systems where a single long-context request triggered a migration cascade that stalled four GPUs for 200 milliseconds. Systems where the KV cache crossed a residency boundary that nobody knew existed because nobody measured it. And the only teams that will not be shocked are the ones that treated virtual memory as a first-class design surface from day one. The ones that drew the page tables on the whiteboard next to the model architecture. The ones that measured TLB hit rates next to tokens per second. The MMU does not care about your model architecture. It does not care about your attention mechanism or your quantization scheme. It cares about one thing: is this page here, or is it somewhere else? Get that question wrong, and the hardware will answer it for you. Slowly.