Skip to main content

Inference

Inference

2 articles tagged with “Inference

When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference

When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference

A distinguished-architect deep dive into GPU virtual memory internals, MMU fault pipelines, TLB shootdown mechanics, page-table walks, Unified Memory/HMM coherence, ATS, and why page migration turns your p99 into a hardware problem nobody on the team budgeted for.

Hazem Ali
Hazem Ali··45 minutes read
Kernel Dynamics: The Real Bottleneck of AI

Kernel Dynamics: The Real Bottleneck of AI

Why LLM inference speed is dominated by kernel execution, memory traffic, and runtime scheduling — not raw FLOPS. A deep technical guide to prefill vs decode, the Roofline model, memory walls, FlashAttention, KV cache paging, warp mechanics, and GPU pipeline design.

Hazem Ali
Hazem Ali··35 min read