Kernel Dynamics: The Real Bottleneck of AI

There is a comfortable lie we tell ourselves in AI development. We write model.generate(), we watch tokens drip onto the screen, and we imagine a clean mathematical ritual happening somewhere behind the curtain. Matrices multiply. Probability flows. The rest is plumbing.

But the map is not the territory.

When you peel back the layers of PyTorch, strip away the CUDA abstractions, and look at what actually happens on silicon, you do not find clean mathematics. You find traffic. You find scheduling. You find a machine fighting to move bytes through a finite set of pathways.

That fight is the real bottleneck.

Most people think LLM performance is about raw compute, FLOPS. That belief survives because it sounds scientific. It comes with numbers that fit on slides. It makes the story feel linear.

Bigger chip, more FLOPS, faster model.

In real systems, speed lives and dies by data movement. By how often you make the hardware go to global memory. By how well you preserve locality. By whether your kernels keep the machine fed, or leave it starving while it waits for bytes.

This is not a metaphor. It is the literal shape of inference at scale.

The first split most people never make

I keep seeing the same mistake, even from strong teams. They talk about inference like it is one workload with one bottleneck, then they spend weeks tuning the wrong thing.

Inference is not one thing. It is two different workloads that share a name.

Prefill is the part everyone benchmarks because it looks impressive. You have a prompt. You push it through the model. You build the initial state and populate the first KV entries.

Prefill can be heavy, wide, parallel. The shapes are often large enough that the GPU gets to behave like the marketing slides promised.

Big GEMMs, high occupancy, tensor cores busy, decent arithmetic intensity. With the right batching and the right kernels, you can drive real utilization. It is still constrained by memory traffic, but it at least gives you room to trade math scheduling for throughput.

Decode is the part that humbles you.

You generate one token at a time. Each step depends on the previous token, so the system becomes sensitive to latency and to everything your profiler used to hide under averages. Work per step shrinks. Kernel launch overhead starts to matter.

Cache locality gets messy. KV reads turn into a constant tax. Memory pressure rises as sequences grow. Scheduling decisions move from being a nice-to-have to being the difference between stable p95 and a system that spikes under load.

This is where the FLOPS story starts to crack. Not because the GPU got weaker, but because the workload stopped being friendly.

Here's the line I come back to when I design serving stacks:

Prefill sells your benchmark. Decode pays your production bill. — Hazem Ali

If you talk about LLM speed without separating prefill from decode, you end up mixing two different performance regimes into one misleading number. You will say true things for prefill that become wrong in decode. You will optimize matmul throughput while your real bottleneck is KV traffic, launch overhead, and scheduler behavior. And you will think you have a fast system until real traffic arrives.

This is also why I keep insisting, like I did in my Microsoft article, that production behavior is a runtime property.

You can run the same model with the same weights and still get different bottlenecks depending on how the execution path is shaped by batching, cache state, and the serving pipeline. Prefill and decode are the cleanest example of that split, because it is not theoretical.

It is deterministic workload physics.

The only accurate question

The only accurate question is not how many FLOPS your GPU has.

It is which resource becomes your ceiling first.

Sometimes you are compute bound. You have enough locality and enough arithmetic intensity that math throughput is the limit.

Sometimes you are bandwidth bound. You stream so many bytes that global memory becomes the limiter.

Sometimes you are latency or overhead bound. Your work units are small. Your accesses are scattered. Your pipeline is dominated by launch costs, synchronization, pointer chasing, or queueing.

Every real inference stack lives across all three. The trick is identifying which regime you are in at each stage and shaping the kernels and the scheduler to move the bottleneck to something you can afford.

That is what kernel dynamics really means.

Why the Roofline model is not optional

Let me put that very clear. If you cannot put an upper bound on performance, you are not doing optimization. You are doing trial and error.

The Roofline model gives you a simple bound that is hard to argue with. Achievable throughput is limited by the smaller of two ceilings.

Peak compute is one ceiling.

Peak memory bandwidth is the other. Operational intensity connects them.

Operational intensity is just FLOPs per byte moved.

The byte part matters. You have to define what memory level you mean. Most of the time, the first question is DRAM or HBM traffic, because that is the expensive trip that sets the pace for many kernels.

Once you have an intensity estimate, the bound is straightforward.

Achievable FLOPs per second is at most the minimum of peak compute and sustained bandwidth times operational intensity.

That single sentence changes how you think. It forces a concrete accounting.

How many bytes does this step move to and from HBM. How many FLOPs does it do on those bytes. What reuse exists before the data falls back to HBM.

If intensity is low, the diagonal bandwidth roof becomes your ceiling. You can rewrite kernels all day and still be limited by bytes per second. Your only real levers are reducing memory traffic, increasing reuse, improving locality, or changing precision and layout so the same work moves fewer bytes.

If intensity is high, the horizontal compute roof becomes your ceiling. Now the levers shift. You care about tensor core utilization, instruction mix, pipeline stalls, register pressure, occupancy, and whether you are issuing the right math instructions at the right rate.

There is also a ridge point, the intensity where the two ceilings meet. Below it, you are bandwidth limited. Above it, you can become compute limited. That ridge is a useful sanity check, not a badge.

One more honest caveat. Roofline is a throughput bound. It does not model everything that hurts you in decode. Kernel launch overhead, synchronization, queueing, fragmentation, and pointer chasing can dominate even when the Roofline looks generous. So use Roofline to identify which ceiling you are hitting, then verify with profiling and real traffic counters.

What this really means is simple. If you cannot estimate bytes moved, you cannot predict performance.

Roofline model visualization

Credits: Hazem Ali

You can only measure after the fact and call it insight.

The memory hierarchy is the real API

We like to talk about models. The GPU talks about hierarchy.

Global memory is huge and fast in bandwidth terms, but it is slow in latency terms compared to on chip storage. Caches exist, but they are not magic. They reward locality. They punish chaos.

Closer to the cores you have smaller and faster storage. L2 can help when there is reuse and when access patterns align with what it can keep hot. L1 and shared memory are even faster, but they are tiny and they demand careful layout. Registers are the fastest of all, and also the most fragile resource in the system. Spill registers and you quietly create more global memory traffic.

This hierarchy is not an implementation detail. It is the cost model.

If an intermediate result stays in registers or shared memory, it is cheap.

If you write it out to global memory and then read it back, you are paying the worst price twice.

That is the traffic jam hiding behind your clean API.

The memory wall, defined properly

People say the GPU is waiting on memory. That sentence is too vague to be actionable. In practice it hides three different failure modes, and they behave differently under load. If you do not name the mode, you will optimize the wrong thing and still feel stuck.

1. Bandwidth saturation: you are moving too many bytes, too efficiently

This is the cleanest kind of pain. You are streaming data in a way the hardware actually likes. Requests coalesce. Access patterns are regular. The memory controllers stay busy. Your profiler will show a high fraction of peak HBM utilization.

The bad news is you cannot outsmart physics here. If the kernel is already well structured and the bus is busy, you are at the wall. The only way forward is to move fewer bytes per unit of useful work or reuse bytes more before they fall back to HBM.

What this really means is you stop thinking like an algorithm designer and start thinking like a logistics engineer. Every extra read or write is a truck you sent across the same highway.

Typical fixes are not mysterious. They are unpleasantly concrete. Fuse kernels so intermediates do not round-trip to HBM. Increase reuse by tiling so the same bytes do more work while they are still on chip. Reduce bytes moved by changing precision, packing formats, or avoiding materialization of transient tensors. If you are saturating bandwidth, your problem is not clever math. It is traffic volume.

2. Latency and irregular access: the bus is not full, but you are still slow

This is the one that confuses teams because peak bandwidth numbers look great on paper, yet the kernel crawls. You are not limited by bandwidth. You are limited by how painful it is to fetch the next piece of data.

The signature here is scattered access. Gather patterns. Pointer chasing. Indirection tables. Page maps. Anything that turns memory into a series of small, poorly coalesced requests. The memory system is capable of moving a lot of data per second, but it cannot do it efficiently when every request is tiny, far apart, and unpredictable. Caches often do not save you because the working set is either too large, too random, or both.

This mode shows up in real serving when you introduce structures like paged KV blocks, variable-length batching, or any scheme that trades contiguity for utilization. The trade can still be worth it. But you should be honest about the cost: you are paying in latency per access, not in raw bytes per second.

Fixes here are about making access more predictable. Improve locality with layout and blocking. Make the mapping structure cache-friendly. Reduce indirection in the hot path. Batch in a way that aligns requests and reduces scatter. If you are latency bound, throwing bandwidth at the problem does not help. You need to change the shape of the access.

3. Locality collapse: everything looks like global because you lost reuse

This one feels like the system is falling apart. You are not necessarily saturating bandwidth, and you are not necessarily doing random gathers. You are just not getting any benefit from cache because your working set is too big, reuse is too low, or both.

It happens when the model state and KV footprint grow past what your caches can keep hot, or when your batching policy mixes unrelated sequences so aggressively that locality evaporates. The result is cache thrash. Lines get evicted before they can be reused. L2 becomes a pass-through. The hardware spends time moving data that never earns its keep.

The fix is not a single trick. It is usually a working-set problem. Shrink the hot set with fusion and tiling. Keep the right data resident longer by structuring the schedule so reuse actually happens. Stabilize shapes and batching so the system sees repeated patterns instead of chaos.

The rule that makes this section useful

If you claim memory is the bottleneck, you need to say which one.

Bandwidth saturation means you are moving too many bytes, and you win by reducing bytes or increasing reuse.

Latency and irregular access means you are paying per fetch, and you win by improving locality and reducing indirection in the hot path.

Locality collapse means reuse is not landing, and you win by shrinking the working set and stabilizing access patterns.

The memory wall has three distinct failure modes

The memory wall has three distinct failure modes. Bandwidth saturation, latency from irregular access, and locality collapse. Each has different signatures and different fixes.

Same symptom. Different diseases. Different cures.

Why FLOPS often misleads

A modern accelerator can perform an absurd amount of math. If you can keep it fed.

In inference, especially in decode and attention heavy paths, a large fraction of time is spent moving weights, reading KV cache, writing KV cache, staging tiles, reshaping layouts, and managing the hidden costs of abstraction.

The system can be slow while the math units are bored.

That is why speed in real stacks often correlates more with memory bandwidth and locality than with headline FLOPS.

Not always. Prefill can be compute heavy under favorable shapes. Some fused kernels can become compute limited. Some workloads hit synchronization or launch overhead first. The point is that FLOPS alone is not a bound. It is a capability that may or may not be reachable.

The fusion imperative

Most framework graphs were not designed to respect the memory hierarchy. They were designed to be readable.

A typical layer becomes a sequence. Linear. Activation. Normalization. Maybe a residual path. Each stage becomes a kernel. Each kernel reads inputs and writes outputs. In the naive world, intermediates bounce to global memory between kernels.

That bounce is the expensive part.

Kernel fusion is the act of refusing to pay it.

Instead of launching separate kernels, you combine operations so intermediates can remain on chip while you finish the sequence. Registers, shared memory, L1. The exact storage depends on the kernel and the shape, but the principle is stable.

Keep data close. Avoid global round trips.

This is where the story becomes honest. Fusion is not free. It pushes complexity into the kernel. It increases register usage. It can reduce occupancy. It can trigger register spills. Spills go to local memory, and local memory is backed by global memory. A poorly fused kernel can be slower than the unfused version.

So fusion is not a mantra. It is byte traffic accounting under constraints.

Reduce global reads and writes without triggering spills and without collapsing parallelism.

That is the real work.

Attention is an IO problem in disguise

Attention looks like a clean equation. In practice it is an IO machine.

The trap is not the dot product. The trap is the intermediate. Naive attention wants you to build an attention score matrix of shape N x N, then softmax it, then multiply by V.

Naive attention vs IO-aware attention

Naive attention materializes N × N intermediates and pays HBM traffic twice. IO-aware attention tiles K and V, maintains online softmax state, and writes only the final output.

That matrix is transient. You do not actually need to store it.

But if you materialize it, you pay for it twice: write to HBM, read it back. As N grows, that global memory traffic becomes a dominant cost, even when the GPU has plenty of compute headroom.

This is why attention kernels became a battleground. Not because someone discovered a new softmax. Because the winning implementations learned to avoid writing the wrong thing to global memory.

In attention, the fastest path is the one that refuses to remember what it never needed to store.

The naive version and the IO mistake

This is the readable form that creates the N x N intermediate:

python

import torch, math
 
def naive_attention(q, k, v):
    # q,k,v: [B, H, N, D]
    s = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(q.size(-1)))  # [B,H,N,N]
    p = torch.softmax(s, dim=-1)                                    # [B,H,N,N]
    return p @ v                                                    # [B,H,N,D]

The math is fine. The problem is the implied memory behavior as N grows.

The IO aware pattern: tile and stream with online softmax

FlashAttention's core idea is to compute exact attention while reducing HBM reads and writes, by tiling and avoiding storage of large intermediates. The mechanism relies on an online softmax style update so you can accumulate output without materializing the full N x N matrix.

Here is a minimal sketch of the online softmax invariant.

It is meant to show what stays resident and what never gets written:

python

import torch, math
 
def online_softmax_attention_sketch(q, k, v, block_k=128):
    # q,k,v: [N, D]
    N, D = q.shape
    scale = 1.0 / math.sqrt(D)
    out = torch.zeros((N, D), device=q.device, dtype=torch.float32)
    m = torch.full((N,), float("-inf"), device=q.device)  # running max
    l = torch.zeros((N,), device=q.device)                # running sum exp
    for j in range(0, N, block_k):
        k_blk = k[j:j+block_k]
        v_blk = v[j:j+block_k]
        s = (q @ k_blk.T) * scale                          # [N, Bk]
        block_m = s.max(dim=-1).values
        new_m = torch.maximum(m, block_m)
        # rescale previous contributions to the new max
        alpha = torch.exp(m - new_m)
        l = l * alpha
        out = out * alpha.unsqueeze(-1)
        p = torch.exp(s - new_m.unsqueeze(-1))             # [N, Bk]
        l = l + p.sum(dim=-1)
        out = out + (p @ v_blk)
        m = new_m
    return out / l.unsqueeze(-1)

What changes is not the result. It is the data movement.

You stream K and V in tiles, keep only running statistics plus the partial output, and you write the final output once. That is why it scales better in practice.

One practical note that makes this real in PyTorch

Modern PyTorch exposes scaled dot product attention and attempts to pick an optimized backend depending on shapes and hardware.

That is a signal worth mentioning because it reflects the reality: the implementation path matters.

FlashAttention as a pattern, not a brand

FlashAttention is best understood as a design pattern.

Tile the computation. Work on blocks that fit into fast on chip storage. Compute in a way that preserves numerical correctness without materializing the full attention matrix. Keep transient values transient.

The core idea is IO awareness. You compute what you need, when you need it, and you avoid writing massive intermediates to global memory.

This is the difference between a kernel that respects the memory hierarchy and one that fights it.

It is also why improvements in attention kernels can unlock long context windows. The point is not that the GPU suddenly got smarter. The point is that the system stopped pushing an impossible volume of transient data through global memory.

Serving turns KV cache into an allocator problem

Training hides many of the nastiest serving realities. In serving, the world is variable.

Requests arrive with different prompt lengths. They generate different output lengths. They finish at different times. Batches change constantly. The load is jagged.

Meanwhile decode needs history. That history is the KV cache.

The KV cache is not optional. It is the memory footprint of the conversation.

And it grows.

In a naive design, you allocate a contiguous region per request. But you rarely know the final length in advance. So you over allocate or you reallocate, and either way you create waste. Variable lengths produce holes. Holes produce fragmentation. Fragmentation produces lost capacity. Lost capacity reduces concurrency. Reduced concurrency reduces throughput.

That is how a memory allocator becomes a throughput limiter.

Paging is virtual memory for the GPU, in spirit

The fix is an old idea from operating systems.

Stop requiring contiguity.

Break the KV cache into blocks. Store blocks in device memory wherever space exists. Maintain a mapping from logical positions to physical blocks.

At the kernel level, this introduces indirection. Instead of a simple pointer increment, you consult the mapping to find the next block.

Indirection has overhead. But the gain is that you can pack memory much tighter across a dynamic set of requests. You reduce fragmentation. You increase usable capacity. You increase batch efficiency.

This is why paging wins in many serving regimes.

Not because indirection is fast. Because wasted VRAM is slower.

The price of indirection

Indirection is not free. It can change the bottleneck.

That's why many principal engineers have not even realized indirection is not "one extra lookup". It can flip the whole kernel from bandwidth-bound streaming into latency-bound dependency chains.

With paged KV, you store KV blocks non-contiguously and use a block table to map logical positions to physical blocks. That mapping is the win for fragmentation, but it inserts an extra level of address translation into the hot path.

Now the GPU side consequence: coalescing is warp-based. If lanes in a warp land in many different memory segments, you do not get one clean transaction. You get multiple segment fetches, and the hardware may have to replay memory instructions. Modern discussion consistently points to 32-byte segments as the fundamental DRAM transaction quantum, and Ampere-style sectoring concepts show how scattered patterns explode the number of sectors touched.

Here is the rare failure mode that kills tail latency. If the block table itself is not hot in L2, you pay an extra device-memory round trip just to learn which device-memory address to fetch next. That is pointer chasing in front of pointer chasing.

Tail latency path from block table misses

Tail latency path. When the block table misses in L2, the kernel performs two dependent HBM reads: first to fetch metadata (block id), then to fetch the KV payload. If lanes map to different pages, coalescing collapses and the second read fans out into many segments.

L2 can act as a coalescing buffer for some write behavior, but it cannot magically turn a scattered gather into a stream if locality is gone.

The scheduler is where it becomes a systems problem. Batching policy changes which pages are touched together. That changes locality and cache hit rates. That changes whether the kernel is bandwidth-limited or latency-limited.

NVIDIA's own guidance frames this clearly: increasing bandwidth helps bandwidth-bound kernels, but it may not help latency-bound kernels unless you improve cache hits or increase parallelism to hide latency.

Paging does not slow you down by doing more work. It slows you down by making the work harder for the memory system to predict. — Hazem Ali

The deepest layer: warps are the unit of execution

At the metal, the GPU does not execute threads one by one. It executes warps, and a warp is the real scheduling unit you are negotiating with. On NVIDIA GPUs, threads inside a block are grouped into warps of 32 threads, executing in a SIMT model.

That single fact reshapes everything.

A warp has one instruction stream at a time.

Warp execution with divergence and reconvergence

Warp execution is mask-driven. Divergence serializes paths inside a warp, then reconverges. You lose effective parallelism even though you still have 32 lanes.

When lanes take different branches, the warp does not magically run both paths in parallel. It serializes them under different active masks, then reconverges. You still have 32 lanes, but only some are doing useful work at any instant.

That is why high performance kernels feel obsessed with two things:

Keep warps converged
When you must communicate, do it with the cheapest on-chip path that matches the scope

Shared memory is fast until the banks fight you

When lanes in a warp share data, the naive move is shared memory plus a barrier. That works, but it has sharp edges.

Shared memory is banked. When multiple lanes hit the same bank in a way that conflicts, accesses serialize. Performance collapses and it looks like the GPU "randomly" slowed down.

Shared memory banks vs warp shuffle

Shared memory is fast but banked and often needs barriers. Shuffle moves values lane-to-lane in registers, avoiding shared memory staging when the scope is one warp.

You can see the classic pattern in reductions: store to shared memory, sync, then load again. It is correct, but it pays for storage, a barrier, and potential bank conflicts.

Shuffles are register-level warp collectives

For warp-local exchange, you can often skip shared memory entirely and use warp shuffle intrinsics. They exchange values between lanes inside the same warp.

No shared memory staging. No block-wide barrier. Register to register.

This is why reductions can be made brutally fast. You keep the whole reduction inside the warp, inside registers, and you avoid the shared memory bank hazards.

A minimal warp reduction looks like this:

cpp

#include <cuda_runtime.h>
 
__device__ __forceinline__ float warp_sum(float v) {
    // All threads named in 'mask' must participate with the same mask,
    // or behavior is undefined.
    unsigned mask = 0xffffffff;
 
    v += __shfl_down_sync(mask, v, 16);
    v += __shfl_down_sync(mask, v,  8);
    v += __shfl_down_sync(mask, v,  4);
    v += __shfl_down_sync(mask, v,  2);
    v += __shfl_down_sync(mask, v,  1);
    return v; // lane 0 holds the sum
}

Two low-level details matter here more than people admit:

The _sync shuffles take a mask that defines participating lanes. All non-exited lanes in that mask must execute the same intrinsic with the same mask, or results are undefined.
Shuffles are warp-local. They do not replace shared memory when you need reuse across warps, cross-warp staging, or larger cooperative tiles. They complement it.

Communication patterns are an algorithm

At speed, what you compute matters less than how you move values between lanes and where you let them land.

If the exchange is purely intra-warp and fits shuffle semantics, shuffle is usually the cleanest path.
If you need cross-warp reuse or structured tiling, shared memory is still the right tool, but now you must design for bank behavior.

What this really means is your kernel is not just math. It is a choreography between: warp convergence, lane masks, on-chip communication, and the memory layout that decides whether shared memory is a freeway or a traffic accident.

Tensor cores are fast math with strict rules

Tensor cores accelerate matrix multiply accumulate on small tiles. They can deliver enormous throughput when fed correctly.

The important part is not the API. It is the contract.

Tile shapes matter. Layout matters. Alignment matters. Strides matter. Staging matters.

If your data is not in the form the tensor core path expects, performance drops. Sometimes you fall back to a slower path. Sometimes you stay on tensor cores but lose efficiency because your pipeline stalls.

This is why high performance kernels look less like math and more like choreography.

You are not writing a loop. You are feeding a machine.

Shared memory is fast until you anger the banks

Shared memory is divided into banks. When threads in a warp access addresses that map to the same bank, accesses can serialize. That is a bank conflict. Bank conflicts are one of the most common reasons a kernel that looks correct becomes slow.

The fix is usually layout engineering.

Padding can help. Reindexing can help. Changing how data is stored and how threads traverse it can help. Sometimes the fix is a deliberate permutation. That is where swizzling enters.

Swizzling is controlled chaos

Swizzling is what you do when the math is fine but the memory system is angry.

Shared memory is split into banks, and a warp hits those banks every cycle. If multiple lanes land on the same bank, the hardware has to serialize what should have been parallel. That is the real problem. Bank conflicts are not an edge case. They are a performance cliff.

Swizzling is a family of address remaps that deliberately decorrelate lane ID from bank ID. Adjacent threads stop hammering adjacent addresses. The mapping looks like noise in a trace, because it breaks the human expectation of linear layout. But the memory hardware does not care about your aesthetic. It cares about whether 32 lanes map to 32 banks in the same cycle.

XOR swizzling is the simplest way to do this. You take a part of the address that represents the bank selection bits and you perturb it with something like the row index, warp index, or a tile coordinate.

The goal is not randomness. The goal is to spread accesses so that lanes that are logically adjacent become physically bank-separated.

Here is a minimal example for shared memory tiles. The exact shift depends on how many bank-select bits matter for your element size, but the structure is always the same:

cpp

// swizzle an element index inside a 2D shared-memory tile
// row, col are logical coordinates
// log_banks is typically 5 for 32 banks, but the effective bits depend on element size
__device__ __forceinline__ int swizzled_col(int row, int col, int log_banks) {
    // xor some row bits into the column to smear bank mapping
    return col ^ (row >> log_banks);
}

What makes this deep is that you are not just changing an index. You are shaping the conflict pattern at the hardware level. You are turning a worst-case stride into a conflict-free permutation. The benefit is often invisible in high-level profiling until you look at shared-memory replay and serialization metrics, then it suddenly explains everything.

So yes, it looks unintuitive. It is supposed to.

The fastest kernel is often the one that stops resembling the algebra and starts resembling the memory system.

The kernel is a pipeline, not a function

A high performance kernel is rarely "load, compute, store."

That mental model is already too late. The real kernel is a pipeline because the GPU is a latency-hiding machine with multiple subsystems that can run in parallel if you give them independent work.

Here's the physical reality.

Global memory latency is huge compared to arithmetic. If you wait for every load before you compute, the SM sits idle. So the kernel becomes a schedule that keeps three things moving at once: global to shared transfers, shared to registers staging, and math on tensor cores or FP units.

The simplest correct shape is double buffering. While you compute on buffer 0, you are already filling buffer 1. Then you swap. That producer consumer pattern is not optional. NVIDIA documents it explicitly for overlapping asynchronous global to shared copies with compute, using double buffering in shared memory.

The hardware lever: asynchronous copies and barriers

On modern NVIDIA GPUs, you can issue asynchronous copies from global memory into shared memory and only wait when the data is actually needed. The CUDA programming guide describes cuda::memcpy_async and pipeline objects for sequencing copies into stages.

At the PTX level, this is backed by cp.async and, in newer forms, asynchronous barriers like mbarrier that track completion. The PTX ISA documents ordering semantics for barrier arrive operations tied to async copies.

That changes how you write kernels. You do not block on loads. You launch them early, do useful work, then wait later when you must.

Here is the core structure in CUDA C++ style. The point is not the exact API shape. The point is the schedule: prologue to fill, steady state to overlap, epilogue to drain.

cpp

#include <cuda/pipeline>
 
template<int STAGES>
__global__ void pipelined_kernel(const half* A, const half* B, float* C, int N) {
    extern __shared__ half smem[];
    half* smemA[STAGES];
    half* smemB[STAGES];
 
    // carve shared memory into stage buffers
    for (int s = 0; s < STAGES; ++s) {
        smemA[s] = smem + s * 2 * 256;
        smemB[s] = smem + s * 2 * 256 + 256;
    }
 
    auto pipe = cuda::make_pipeline();
 
    int tile = 0;
 
    // Prologue: prefetch first STAGES tiles
    for (int s = 0; s < STAGES; ++s) {
        pipe.producer_acquire();
        cuda::memcpy_async(smemA[s], A + (tile + s) * 256, 256 * sizeof(half), pipe);
        cuda::memcpy_async(smemB[s], B + (tile + s) * 256, 256 * sizeof(half), pipe);
        pipe.producer_commit();
    }
 
    // Steady state
    for (tile = 0; tile < N; ++tile) {
        int s = tile % STAGES;
 
        pipe.consumer_wait();     // wait until stage s is ready
        __syncthreads();          // or a tighter sync if only some threads consume
 
        // Compute on stage s
        // do mma or fused ops using smemA[s], smemB[s]
 
        __syncthreads();
        pipe.consumer_release();  // release stage s
 
        // Refill stage s for a future tile, keeping overlap alive
        int next = tile + STAGES;
        if (next < N) {
            pipe.producer_acquire();
            cuda::memcpy_async(smemA[s], A + next * 256, 256 * sizeof(half), pipe);
            cuda::memcpy_async(smemB[s], B + next * 256, 256 * sizeof(half), pipe);
            pipe.producer_commit();
        }
    }
}

The deep trap is that pipeline depth is not free. Every additional stage consumes shared memory and registers.

Pipeline depth tradeoff

Pipeline depth is a tradeoff. Adding stages improves overlap between async copies and compute, but each stage consumes shared memory and registers, which reduces occupancy. Past the sweet spot, lower occupancy stops hiding memory latency and the kernel collapses into stop and go stalls.

More stages can increase overlap, but can also reduce occupancy. Reduced occupancy can reduce your ability to hide latency, and then your pipeline collapses into stop and go anyway. NVIDIA's pipeline model is explicitly about sequencing async copies into stages, but you still have to balance resources.

Alignment is not a detail, it is part of the contract

Async copy paths have alignment constraints. For example, cp.async.bulk has documented requirements such as 16-byte alignment and size being a multiple of 16. If you violate that, you lose the fast path or pay extra fixups.

So when you say "prefetch," what you really mean is "issue transfers that the hardware can service efficiently." Layout and alignment are pipeline design, not polish.

Hopper pushes the idea further with dedicated copy engines

On Hopper, NVIDIA introduced the Tensor Memory Accelerator, a hardware unit for bulk asynchronous transfers between global memory and shared memory, paired with asynchronous transaction barriers. This is explicitly described in NVIDIA's Hopper architecture materials.

The point is not the marketing name. The point is that the memory movement path can run more independently from compute, which makes deep pipelining even more valuable, and makes synchronization strategy more important.

Warp specialization is the extreme form of the same idea

If you want to go even lower level, you sometimes split a thread block into producer and consumer warps. One warp's job is to keep the pipeline fed. The others compute. NVIDIA shows this pattern in the async copy section as a producer consumer specialization example, and modern literature describes it as manual deep software pipelines.

This is why the kernel is a schedule. You are not just writing math. You are orchestrating independent units so the SM is never waiting on the next byte.

When the pipeline is wrong, the symptoms are predictable

When the pipeline is done well, you see fewer stalls, higher sustained bandwidth, and high tensor core utilization because data arrives on time.

When it is done poorly, you see bubbles. Waits that line up with loads. Barriers that are too wide. Stage buffers that thrash shared memory. Occupancy collapse from register pressure. The kernel becomes stop and go traffic because the schedule has no slack.

That is the deepest truth here. A fast kernel is not a function. It is a conveyor belt, and your job is to keep every station busy.

Decode exposes launch overhead

Decode is where your illusion of efficiency dies quietly.

In prefill, kernels are fat enough that launch overhead is background noise. In decode, the work per step shrinks. You are doing a token, then another token, and each token is a small parade of kernels. Once kernels get short, the fixed costs stop being amortized. You start paying for the plumbing. CPU side launch work. Driver submission. Synchronization points you forgot you added. Tiny gaps between kernels that look harmless until you multiply them by thousands of tokens.

NVIDIA's own Nsight Systems guidance makes this visible: those inter kernel gaps are real overhead, not imagination.

This is why CUDA Graphs matter in decode. Not as a gimmick, as a way to stop paying the same launch tax over and over.

The idea is simple and very physical. You capture a stable sequence of GPU operations once, then you replay it with much lower CPU involvement. Instead of issuing kernel launches one by one, you launch a prebuilt graph of nodes and dependencies.

NVIDIA describes CUDA Graphs exactly in those terms: bundling many operations into one launchable unit to reduce launch overhead. PyTorch's graph support is built on the same principle.

But here is the part that is actually principle level. Graphs do not make kernels faster. They make the schedule cheaper.

If your per token path is stable, replay is a win.

If your per token path is unstable, you pay for instability in multiple ways. Dynamic shapes, changing batch composition, conditional control flow, allocator activity, all of this can make capture difficult or force you to maintain multiple graphs. NVIDIA's programming guide is explicit that graphs have rules and limitations around capture and updating instantiated graphs. vLLM ends up capturing different cases because the fast path depends on the regime you are in, not just the model.

So the real engineering problem becomes this: make decode look like a replayable loop without lying to yourself about variability. That often means constraining the serving layer, not just tuning kernels. Stable shapes. Stable launch topology. A controlled set of batch sizes. Warmup runs that capture the steady state. vLLM's design docs are blunt about doing dummy runs to capture graphs for the right execution modes.

Here is what graph capture looks like in PyTorch. This is the minimal pattern, and it explains why a stable decode step benefits.

python

import torch
 
# Static buffers help. Dynamic allocation inside the graph hurts capture and replay.
static_inp = torch.empty((B, H, D), device="cuda", dtype=torch.float16)
static_out = torch.empty((B, H, D), device="cuda", dtype=torch.float16)
 
g = torch.cuda.CUDAGraph()
 
# Warmup to allocate and select kernels
for _ in range(3):
    static_out.copy_(model_step(static_inp))
 
torch.cuda.synchronize()
 
# Capture the steady decode step
with torch.cuda.graph(g):
    static_out.copy_(model_step(static_inp))
 
# Replay per token with new data copied into static buffers
def decode_step(new_inp):
    static_inp.copy_(new_inp)
    g.replay()
    return static_out

The punchline is the same as your last line, just sharper.

A system can be slow even if every kernel is individually optimized, because the schedule is broken. Decode is where that becomes undeniable. You can have perfect kernels and still lose to launch overhead, synchronization structure, and an unstable execution path.

If you want one sentence to carry the section, use this.

Prefill rewards fast kernels. Decode punishes a slow schedule. — Hazem Ali

Multi GPU introduces a new wall

On a single GPU, the wall is usually local: HBM bandwidth, cache locality, and whether your kernels keep data resident long enough to matter.

Once you go multi GPU, you add a new mandatory ingredient: coordination. The moment you shard weights or activations, each GPU stops being a complete computer and becomes a partial calculator. Partial results have to meet somewhere. That meeting is not free. Collectives, synchronization, and fabric scheduling become part of the critical path.

Now your ceiling can be the interconnect rather than HBM.

Bottleneck shift from single GPU to multi GPU

Bottleneck shift. Single GPU is often bounded by HBM and locality. Multi GPU introduces a fabric ceiling through collectives and synchronization.

You can have perfect kernels and still degrade at scale if communication is not overlapped and staged like a pipeline. When the collective sits on the boundary of every step, the system starts behaving like stop and go traffic.

Overlap vs serialized communication

Overlap or stall. Serialized communication creates visible gaps that hit tail latency. Overlap keeps the schedule tight and scaling stable.

Compute, wait, communicate, wait, repeat. That is how you create a faster GPU and a slower system.

This is why some deployments feel like they hit a new wall after kernel optimization. You did. You moved the bottleneck.

The serving scheduler decides which kernels you even run

Production throughput is not just kernel speed. It is policy made physical.

The scheduler is not a side component that feeds the GPU. It is the thing that defines the workload. It decides which requests are grouped, which tokens advance, which sequences stall, and which memory layout you end up touching. That means it decides which kernels you run and how often you run them.

Batching is the obvious lever, but the deeper levers are the ones most teams never model. How you interleave prefill with decode changes the shape distribution of matmuls and attention, which changes kernel selection and whether you get a fast fused path or a slow generic path. Paging policy changes whether KV reads behave like a stream or a gather, which flips you between bandwidth bound and latency bound regimes. Allocation policy decides whether you reuse blocks or fragment memory, which decides whether a supposedly stable decode step stays stable across hours of traffic.

Queueing is where it becomes brutal. The scheduler is also deciding when work is allowed to wait. That waiting changes cache residency, changes which sequences share pages, changes which blocks stay hot in L2, and changes whether collectives land on the critical path. By the time you observe a p99 spike, you are often seeing the aftershock of a scheduler decision made hundreds of microbatches earlier.

So yes, change the scheduler and you change the kernel mix. But the chain is tighter than that. Kernel mix changes memory pressure. Memory pressure changes cache hit rates and fragmentation. That changes whether your bottleneck is HBM bandwidth, metadata latency, or fabric synchronization. Tail latency is not a property of one kernel. It is a property of how the system composes kernels under load.

This is why two stacks can run the same model on the same GPU and still deliver wildly different p95 and p99. They are not running the same system. One stack is running a stable schedule with predictable locality and bounded variability. The other is running a moving target, and the GPU is just reacting to it.

Key takeaways

Stop treating inference as one workload. Prefill and decode are different regimes with different ceilings. Prefill can look compute heavy under friendly shapes. Decode is where small work units, KV traffic, and scheduling overhead show up in p95 and p99.
Use Roofline to bound reality before you tune anything. Performance is capped by the smaller of peak compute and bandwidth times operational intensity, meaning FLOPs per byte moved. If you cannot estimate bytes moved at the memory level you care about, you are not predicting performance. You are guessing.
Name your memory wall before you try to fix it. Bandwidth saturation, latency from irregular access, and locality collapse are different failure modes. Bandwidth bound work wants fewer bytes and more reuse. Latency bound work wants more predictable access and better locality. Locality collapse wants a smaller hot set and a schedule that actually reuses data before it gets evicted.
Kernel fusion is a bandwidth strategy, not a slogan. Fusing ops can cut HBM round trips by keeping intermediates in on chip storage, but it also increases register and shared memory pressure. Too much pressure reduces occupancy and can trigger spills, which quietly push you back to global memory. In other words, fusion is only a win when the resource balance still lets the SM hide latency.
Attention performance is mostly about avoiding the wrong writes. FlashAttention style kernels reshape attention so they do not materialize the full N by N score and probability matrices in global memory. They rely on tiling and an online softmax style update to keep intermediate state on chip and only write the final output. That is why long context is feasible in practice.
Paged KV is a serving win that introduces a kernel tax. PagedAttention partitions the KV cache into fixed size blocks stored non-contiguously, inspired by virtual memory paging, which reduces waste from variable sequence lengths and fragmentation. The trade is indirection. Indirection can turn streaming reads into scattered gathers and make you latency bound if locality is not preserved.
Warp mechanics are not trivia, they are the execution model. On NVIDIA GPUs, threads are organized into warps of 32. A warp executes one common instruction at a time, and divergence serializes paths within that warp. If you do not reason at warp scope, you will misread why a kernel stalls.
Pipelining is how you buy time, but depth is not free. Overlap works because you can stage the next tile while computing the current one using asynchronous global to shared copies and pipeline staging. But every extra stage consumes shared memory and registers, which can reduce occupancy and break latency hiding. Alignment rules are part of that contract.
Decode can be slow with perfect kernels because launch and orchestration can dominate. CUDA Graphs exist to reduce CPU side launch overhead by capturing a sequence of GPU operations and replaying it as a single launchable unit. It helps when the execution path is stable. When shapes and control flow are unstable, capture and reuse become harder and you may need multiple graphs.
Multi GPU moves the wall. Once you shard, collectives and synchronization become mandatory, and the interconnect can become the ceiling rather than HBM. If communication is not overlapped and staged, scaling degrades even if per GPU kernels are excellent.
The serving scheduler is part of the kernel. Scheduler policy changes batch shape distribution, paging behavior, allocator behavior, and queueing. That changes the kernel mix, which changes memory pressure, which changes tail latency. Two stacks can run the same model on the same GPU and still produce different p95 and p99 because they are running different schedules.

The philosophy of silicon

The future of AI is not only bigger models and bigger datasets.

It is tighter integration between the abstraction you write and the physics that executes it.

You can have a brilliant model design and still lose. Not because the model is wrong, but because the execution is starving. Kernels stall on memory. Intermediates bounce to HBM and back like it is free. Decode gets dominated by launch and synchronization tax. The scheduler turns VRAM into holes and the holes turn into lost concurrency. Nothing is broken, yet the system is slow.

That is the real shift. The unit of progress is no longer only the architecture on paper. It is the end to end path that moves bytes, stages tiles, reuses cache lines, and keeps the machine busy under real traffic.

So when someone asks why one stack feels fast and another feels fragile, the answer is rarely one trick. It is whether the system respects its own cost model. Whether it treats memory as a first class constraint. Whether it shapes workloads so locality survives. Whether it designs a schedule that stays stable when the load stops being polite.

We are no longer only writing code. We are deciding where data lives, when it moves, and what it costs each time it does.

We are choreographing bytes.

And under the hood, bytes are physics.

If you want to debate any section, or you have a production case that does not match this mental model, I am open for discussion on LinkedIn.

But the map is not the territory.

That fight is the real bottleneck.

Most people think LLM performance is about raw compute, FLOPS. That belief survives because it sounds scientific. It comes with numbers that fit on slides. It makes the story feel linear.

Bigger chip, more FLOPS, faster model.

This is not a metaphor. It is the literal shape of inference at scale.

The first split most people never make

I keep seeing the same mistake, even from strong teams. They talk about inference like it is one workload with one bottleneck, then they spend weeks tuning the wrong thing.

Inference is not one thing. It is two different workloads that share a name.

Prefill is the part everyone benchmarks because it looks impressive. You have a prompt. You push it through the model. You build the initial state and populate the first KV entries.

Prefill can be heavy, wide, parallel. The shapes are often large enough that the GPU gets to behave like the marketing slides promised.

Decode is the part that humbles you.

This is where the FLOPS story starts to crack. Not because the GPU got weaker, but because the workload stopped being friendly.

Here's the line I come back to when I design serving stacks:

Prefill sells your benchmark. Decode pays your production bill. — Hazem Ali

This is also why I keep insisting, like I did in my Microsoft article, that production behavior is a runtime property.

It is deterministic workload physics.

The only accurate question

The only accurate question is not how many FLOPS your GPU has.

It is which resource becomes your ceiling first.

Sometimes you are compute bound. You have enough locality and enough arithmetic intensity that math throughput is the limit.

Sometimes you are bandwidth bound. You stream so many bytes that global memory becomes the limiter.

Sometimes you are latency or overhead bound. Your work units are small. Your accesses are scattered. Your pipeline is dominated by launch costs, synchronization, pointer chasing, or queueing.

That is what kernel dynamics really means.

Why the Roofline model is not optional

Let me put that very clear. If you cannot put an upper bound on performance, you are not doing optimization. You are doing trial and error.

The Roofline model gives you a simple bound that is hard to argue with. Achievable throughput is limited by the smaller of two ceilings.

Peak compute is one ceiling.

Peak memory bandwidth is the other. Operational intensity connects them.

Operational intensity is just FLOPs per byte moved.

Once you have an intensity estimate, the bound is straightforward.

Achievable FLOPs per second is at most the minimum of peak compute and sustained bandwidth times operational intensity.

That single sentence changes how you think. It forces a concrete accounting.

How many bytes does this step move to and from HBM. How many FLOPs does it do on those bytes. What reuse exists before the data falls back to HBM.

What this really means is simple. If you cannot estimate bytes moved, you cannot predict performance.

Roofline model visualization

Credits: Hazem Ali

You can only measure after the fact and call it insight.

The memory hierarchy is the real API

We like to talk about models. The GPU talks about hierarchy.

Global memory is huge and fast in bandwidth terms, but it is slow in latency terms compared to on chip storage. Caches exist, but they are not magic. They reward locality. They punish chaos.

This hierarchy is not an implementation detail. It is the cost model.

If an intermediate result stays in registers or shared memory, it is cheap.

If you write it out to global memory and then read it back, you are paying the worst price twice.

That is the traffic jam hiding behind your clean API.

The memory wall, defined properly

1. Bandwidth saturation: you are moving too many bytes, too efficiently

What this really means is you stop thinking like an algorithm designer and start thinking like a logistics engineer. Every extra read or write is a truck you sent across the same highway.

2. Latency and irregular access: the bus is not full, but you are still slow

3. Locality collapse: everything looks like global because you lost reuse

The rule that makes this section useful

If you claim memory is the bottleneck, you need to say which one.

Bandwidth saturation means you are moving too many bytes, and you win by reducing bytes or increasing reuse.

Latency and irregular access means you are paying per fetch, and you win by improving locality and reducing indirection in the hot path.

Locality collapse means reuse is not landing, and you win by shrinking the working set and stabilizing access patterns.

The memory wall has three distinct failure modes

The memory wall has three distinct failure modes. Bandwidth saturation, latency from irregular access, and locality collapse. Each has different signatures and different fixes.

Same symptom. Different diseases. Different cures.

Why FLOPS often misleads

A modern accelerator can perform an absurd amount of math. If you can keep it fed.

The system can be slow while the math units are bored.

That is why speed in real stacks often correlates more with memory bandwidth and locality than with headline FLOPS.

The fusion imperative

Most framework graphs were not designed to respect the memory hierarchy. They were designed to be readable.

That bounce is the expensive part.

Kernel fusion is the act of refusing to pay it.

Keep data close. Avoid global round trips.

So fusion is not a mantra. It is byte traffic accounting under constraints.

Reduce global reads and writes without triggering spills and without collapsing parallelism.

That is the real work.

Attention is an IO problem in disguise

Attention looks like a clean equation. In practice it is an IO machine.

The trap is not the dot product. The trap is the intermediate. Naive attention wants you to build an attention score matrix of shape N x N, then softmax it, then multiply by V.

Naive attention vs IO-aware attention

Naive attention materializes N × N intermediates and pays HBM traffic twice. IO-aware attention tiles K and V, maintains online softmax state, and writes only the final output.

That matrix is transient. You do not actually need to store it.

But if you materialize it, you pay for it twice: write to HBM, read it back. As N grows, that global memory traffic becomes a dominant cost, even when the GPU has plenty of compute headroom.

This is why attention kernels became a battleground. Not because someone discovered a new softmax. Because the winning implementations learned to avoid writing the wrong thing to global memory.

In attention, the fastest path is the one that refuses to remember what it never needed to store.

The naive version and the IO mistake

This is the readable form that creates the N x N intermediate:

python

import torch, math
 
def naive_attention(q, k, v):
    # q,k,v: [B, H, N, D]
    s = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(q.size(-1)))  # [B,H,N,N]
    p = torch.softmax(s, dim=-1)                                    # [B,H,N,N]
    return p @ v                                                    # [B,H,N,D]

The math is fine. The problem is the implied memory behavior as N grows.

The IO aware pattern: tile and stream with online softmax

Here is a minimal sketch of the online softmax invariant.

It is meant to show what stays resident and what never gets written:

python

import torch, math
 
def online_softmax_attention_sketch(q, k, v, block_k=128):
    # q,k,v: [N, D]
    N, D = q.shape
    scale = 1.0 / math.sqrt(D)
    out = torch.zeros((N, D), device=q.device, dtype=torch.float32)
    m = torch.full((N,), float("-inf"), device=q.device)  # running max
    l = torch.zeros((N,), device=q.device)                # running sum exp
    for j in range(0, N, block_k):
        k_blk = k[j:j+block_k]
        v_blk = v[j:j+block_k]
        s = (q @ k_blk.T) * scale                          # [N, Bk]
        block_m = s.max(dim=-1).values
        new_m = torch.maximum(m, block_m)
        # rescale previous contributions to the new max
        alpha = torch.exp(m - new_m)
        l = l * alpha
        out = out * alpha.unsqueeze(-1)
        p = torch.exp(s - new_m.unsqueeze(-1))             # [N, Bk]
        l = l + p.sum(dim=-1)
        out = out + (p @ v_blk)
        m = new_m
    return out / l.unsqueeze(-1)

What changes is not the result. It is the data movement.

You stream K and V in tiles, keep only running statistics plus the partial output, and you write the final output once. That is why it scales better in practice.

One practical note that makes this real in PyTorch

Modern PyTorch exposes scaled dot product attention and attempts to pick an optimized backend depending on shapes and hardware.

That is a signal worth mentioning because it reflects the reality: the implementation path matters.

FlashAttention as a pattern, not a brand

FlashAttention is best understood as a design pattern.

The core idea is IO awareness. You compute what you need, when you need it, and you avoid writing massive intermediates to global memory.

This is the difference between a kernel that respects the memory hierarchy and one that fights it.

Serving turns KV cache into an allocator problem

Training hides many of the nastiest serving realities. In serving, the world is variable.

Requests arrive with different prompt lengths. They generate different output lengths. They finish at different times. Batches change constantly. The load is jagged.

Meanwhile decode needs history. That history is the KV cache.

The KV cache is not optional. It is the memory footprint of the conversation.

And it grows.

That is how a memory allocator becomes a throughput limiter.

Paging is virtual memory for the GPU, in spirit

The fix is an old idea from operating systems.

Stop requiring contiguity.

Break the KV cache into blocks. Store blocks in device memory wherever space exists. Maintain a mapping from logical positions to physical blocks.

At the kernel level, this introduces indirection. Instead of a simple pointer increment, you consult the mapping to find the next block.

This is why paging wins in many serving regimes.

Not because indirection is fast. Because wasted VRAM is slower.

The price of indirection

Indirection is not free. It can change the bottleneck.

That's why many principal engineers have not even realized indirection is not "one extra lookup". It can flip the whole kernel from bandwidth-bound streaming into latency-bound dependency chains.

Tail latency path from block table misses

L2 can act as a coalescing buffer for some write behavior, but it cannot magically turn a scattered gather into a stream if locality is gone.

Paging does not slow you down by doing more work. It slows you down by making the work harder for the memory system to predict. — Hazem Ali

The deepest layer: warps are the unit of execution

That single fact reshapes everything.

A warp has one instruction stream at a time.

Warp execution with divergence and reconvergence

Warp execution is mask-driven. Divergence serializes paths inside a warp, then reconverges. You lose effective parallelism even though you still have 32 lanes.

That is why high performance kernels feel obsessed with two things:

Keep warps converged
When you must communicate, do it with the cheapest on-chip path that matches the scope

Shared memory is fast until the banks fight you

When lanes in a warp share data, the naive move is shared memory plus a barrier. That works, but it has sharp edges.

Shared memory is banked. When multiple lanes hit the same bank in a way that conflicts, accesses serialize. Performance collapses and it looks like the GPU "randomly" slowed down.

Shared memory banks vs warp shuffle

Shared memory is fast but banked and often needs barriers. Shuffle moves values lane-to-lane in registers, avoiding shared memory staging when the scope is one warp.

You can see the classic pattern in reductions: store to shared memory, sync, then load again. It is correct, but it pays for storage, a barrier, and potential bank conflicts.

Shuffles are register-level warp collectives

For warp-local exchange, you can often skip shared memory entirely and use warp shuffle intrinsics. They exchange values between lanes inside the same warp.

No shared memory staging. No block-wide barrier. Register to register.

This is why reductions can be made brutally fast. You keep the whole reduction inside the warp, inside registers, and you avoid the shared memory bank hazards.

A minimal warp reduction looks like this:

cpp

#include <cuda_runtime.h>
 
__device__ __forceinline__ float warp_sum(float v) {
    // All threads named in 'mask' must participate with the same mask,
    // or behavior is undefined.
    unsigned mask = 0xffffffff;
 
    v += __shfl_down_sync(mask, v, 16);
    v += __shfl_down_sync(mask, v,  8);
    v += __shfl_down_sync(mask, v,  4);
    v += __shfl_down_sync(mask, v,  2);
    v += __shfl_down_sync(mask, v,  1);
    return v; // lane 0 holds the sum
}

Two low-level details matter here more than people admit:

The _sync shuffles take a mask that defines participating lanes. All non-exited lanes in that mask must execute the same intrinsic with the same mask, or results are undefined.
Shuffles are warp-local. They do not replace shared memory when you need reuse across warps, cross-warp staging, or larger cooperative tiles. They complement it.

Communication patterns are an algorithm

At speed, what you compute matters less than how you move values between lanes and where you let them land.

If the exchange is purely intra-warp and fits shuffle semantics, shuffle is usually the cleanest path.
If you need cross-warp reuse or structured tiling, shared memory is still the right tool, but now you must design for bank behavior.

Tensor cores are fast math with strict rules

Tensor cores accelerate matrix multiply accumulate on small tiles. They can deliver enormous throughput when fed correctly.

The important part is not the API. It is the contract.

Tile shapes matter. Layout matters. Alignment matters. Strides matter. Staging matters.

This is why high performance kernels look less like math and more like choreography.

You are not writing a loop. You are feeding a machine.

Shared memory is fast until you anger the banks

The fix is usually layout engineering.

Padding can help. Reindexing can help. Changing how data is stored and how threads traverse it can help. Sometimes the fix is a deliberate permutation. That is where swizzling enters.

Swizzling is controlled chaos

Swizzling is what you do when the math is fine but the memory system is angry.

The goal is not randomness. The goal is to spread accesses so that lanes that are logically adjacent become physically bank-separated.

Here is a minimal example for shared memory tiles. The exact shift depends on how many bank-select bits matter for your element size, but the structure is always the same:

cpp

// swizzle an element index inside a 2D shared-memory tile
// row, col are logical coordinates
// log_banks is typically 5 for 32 banks, but the effective bits depend on element size
__device__ __forceinline__ int swizzled_col(int row, int col, int log_banks) {
    // xor some row bits into the column to smear bank mapping
    return col ^ (row >> log_banks);
}

So yes, it looks unintuitive. It is supposed to.

The fastest kernel is often the one that stops resembling the algebra and starts resembling the memory system.

The kernel is a pipeline, not a function

A high performance kernel is rarely "load, compute, store."

That mental model is already too late. The real kernel is a pipeline because the GPU is a latency-hiding machine with multiple subsystems that can run in parallel if you give them independent work.

Here's the physical reality.

The hardware lever: asynchronous copies and barriers

That changes how you write kernels. You do not block on loads. You launch them early, do useful work, then wait later when you must.

Here is the core structure in CUDA C++ style. The point is not the exact API shape. The point is the schedule: prologue to fill, steady state to overlap, epilogue to drain.

cpp

#include <cuda/pipeline>
 
template<int STAGES>
__global__ void pipelined_kernel(const half* A, const half* B, float* C, int N) {
    extern __shared__ half smem[];
    half* smemA[STAGES];
    half* smemB[STAGES];
 
    // carve shared memory into stage buffers
    for (int s = 0; s < STAGES; ++s) {
        smemA[s] = smem + s * 2 * 256;
        smemB[s] = smem + s * 2 * 256 + 256;
    }
 
    auto pipe = cuda::make_pipeline();
 
    int tile = 0;
 
    // Prologue: prefetch first STAGES tiles
    for (int s = 0; s < STAGES; ++s) {
        pipe.producer_acquire();
        cuda::memcpy_async(smemA[s], A + (tile + s) * 256, 256 * sizeof(half), pipe);
        cuda::memcpy_async(smemB[s], B + (tile + s) * 256, 256 * sizeof(half), pipe);
        pipe.producer_commit();
    }
 
    // Steady state
    for (tile = 0; tile < N; ++tile) {
        int s = tile % STAGES;
 
        pipe.consumer_wait();     // wait until stage s is ready
        __syncthreads();          // or a tighter sync if only some threads consume
 
        // Compute on stage s
        // do mma or fused ops using smemA[s], smemB[s]
 
        __syncthreads();
        pipe.consumer_release();  // release stage s
 
        // Refill stage s for a future tile, keeping overlap alive
        int next = tile + STAGES;
        if (next < N) {
            pipe.producer_acquire();
            cuda::memcpy_async(smemA[s], A + next * 256, 256 * sizeof(half), pipe);
            cuda::memcpy_async(smemB[s], B + next * 256, 256 * sizeof(half), pipe);
            pipe.producer_commit();
        }
    }
}

The deep trap is that pipeline depth is not free. Every additional stage consumes shared memory and registers.

Pipeline depth tradeoff

Alignment is not a detail, it is part of the contract

So when you say "prefetch," what you really mean is "issue transfers that the hardware can service efficiently." Layout and alignment are pipeline design, not polish.

Hopper pushes the idea further with dedicated copy engines

Warp specialization is the extreme form of the same idea

This is why the kernel is a schedule. You are not just writing math. You are orchestrating independent units so the SM is never waiting on the next byte.

When the pipeline is wrong, the symptoms are predictable

When the pipeline is done well, you see fewer stalls, higher sustained bandwidth, and high tensor core utilization because data arrives on time.

That is the deepest truth here. A fast kernel is not a function. It is a conveyor belt, and your job is to keep every station busy.

Decode exposes launch overhead

Decode is where your illusion of efficiency dies quietly.

NVIDIA's own Nsight Systems guidance makes this visible: those inter kernel gaps are real overhead, not imagination.

This is why CUDA Graphs matter in decode. Not as a gimmick, as a way to stop paying the same launch tax over and over.

NVIDIA describes CUDA Graphs exactly in those terms: bundling many operations into one launchable unit to reduce launch overhead. PyTorch's graph support is built on the same principle.

But here is the part that is actually principle level. Graphs do not make kernels faster. They make the schedule cheaper.

If your per token path is stable, replay is a win.

Here is what graph capture looks like in PyTorch. This is the minimal pattern, and it explains why a stable decode step benefits.

python

import torch
 
# Static buffers help. Dynamic allocation inside the graph hurts capture and replay.
static_inp = torch.empty((B, H, D), device="cuda", dtype=torch.float16)
static_out = torch.empty((B, H, D), device="cuda", dtype=torch.float16)
 
g = torch.cuda.CUDAGraph()
 
# Warmup to allocate and select kernels
for _ in range(3):
    static_out.copy_(model_step(static_inp))
 
torch.cuda.synchronize()
 
# Capture the steady decode step
with torch.cuda.graph(g):
    static_out.copy_(model_step(static_inp))
 
# Replay per token with new data copied into static buffers
def decode_step(new_inp):
    static_inp.copy_(new_inp)
    g.replay()
    return static_out

The punchline is the same as your last line, just sharper.

If you want one sentence to carry the section, use this.

Prefill rewards fast kernels. Decode punishes a slow schedule. — Hazem Ali

Multi GPU introduces a new wall

On a single GPU, the wall is usually local: HBM bandwidth, cache locality, and whether your kernels keep data resident long enough to matter.

Now your ceiling can be the interconnect rather than HBM.

Bottleneck shift from single GPU to multi GPU

Bottleneck shift. Single GPU is often bounded by HBM and locality. Multi GPU introduces a fabric ceiling through collectives and synchronization.

Overlap vs serialized communication

Overlap or stall. Serialized communication creates visible gaps that hit tail latency. Overlap keeps the schedule tight and scaling stable.

Compute, wait, communicate, wait, repeat. That is how you create a faster GPU and a slower system.

This is why some deployments feel like they hit a new wall after kernel optimization. You did. You moved the bottleneck.

The serving scheduler decides which kernels you even run

Production throughput is not just kernel speed. It is policy made physical.

Key takeaways

Stop treating inference as one workload. Prefill and decode are different regimes with different ceilings. Prefill can look compute heavy under friendly shapes. Decode is where small work units, KV traffic, and scheduling overhead show up in p95 and p99.
Use Roofline to bound reality before you tune anything. Performance is capped by the smaller of peak compute and bandwidth times operational intensity, meaning FLOPs per byte moved. If you cannot estimate bytes moved at the memory level you care about, you are not predicting performance. You are guessing.
Name your memory wall before you try to fix it. Bandwidth saturation, latency from irregular access, and locality collapse are different failure modes. Bandwidth bound work wants fewer bytes and more reuse. Latency bound work wants more predictable access and better locality. Locality collapse wants a smaller hot set and a schedule that actually reuses data before it gets evicted.
Kernel fusion is a bandwidth strategy, not a slogan. Fusing ops can cut HBM round trips by keeping intermediates in on chip storage, but it also increases register and shared memory pressure. Too much pressure reduces occupancy and can trigger spills, which quietly push you back to global memory. In other words, fusion is only a win when the resource balance still lets the SM hide latency.
Attention performance is mostly about avoiding the wrong writes. FlashAttention style kernels reshape attention so they do not materialize the full N by N score and probability matrices in global memory. They rely on tiling and an online softmax style update to keep intermediate state on chip and only write the final output. That is why long context is feasible in practice.
Paged KV is a serving win that introduces a kernel tax. PagedAttention partitions the KV cache into fixed size blocks stored non-contiguously, inspired by virtual memory paging, which reduces waste from variable sequence lengths and fragmentation. The trade is indirection. Indirection can turn streaming reads into scattered gathers and make you latency bound if locality is not preserved.
Warp mechanics are not trivia, they are the execution model. On NVIDIA GPUs, threads are organized into warps of 32. A warp executes one common instruction at a time, and divergence serializes paths within that warp. If you do not reason at warp scope, you will misread why a kernel stalls.
Pipelining is how you buy time, but depth is not free. Overlap works because you can stage the next tile while computing the current one using asynchronous global to shared copies and pipeline staging. But every extra stage consumes shared memory and registers, which can reduce occupancy and break latency hiding. Alignment rules are part of that contract.
Decode can be slow with perfect kernels because launch and orchestration can dominate. CUDA Graphs exist to reduce CPU side launch overhead by capturing a sequence of GPU operations and replaying it as a single launchable unit. It helps when the execution path is stable. When shapes and control flow are unstable, capture and reuse become harder and you may need multiple graphs.
Multi GPU moves the wall. Once you shard, collectives and synchronization become mandatory, and the interconnect can become the ceiling rather than HBM. If communication is not overlapped and staged, scaling degrades even if per GPU kernels are excellent.
The serving scheduler is part of the kernel. Scheduler policy changes batch shape distribution, paging behavior, allocator behavior, and queueing. That changes the kernel mix, which changes memory pressure, which changes tail latency. Two stacks can run the same model on the same GPU and still produce different p95 and p99 because they are running different schedules.

The philosophy of silicon

The future of AI is not only bigger models and bigger datasets.

It is tighter integration between the abstraction you write and the physics that executes it.

We are no longer only writing code. We are deciding where data lives, when it moves, and what it costs each time it does.

We are choreographing bytes.

And under the hood, bytes are physics.

If you want to debate any section, or you have a production case that does not match this mental model, I am open for discussion on LinkedIn.

Kernel Dynamics: The Real Bottleneck of AI

The first split most people never make

The only accurate question

Why the Roofline model is not optional

The memory hierarchy is the real API

The memory wall, defined properly

1. Bandwidth saturation: you are moving too many bytes, too efficiently

2. Latency and irregular access: the bus is not full, but you are still slow

3. Locality collapse: everything looks like global because you lost reuse

The rule that makes this section useful

Why FLOPS often misleads

The fusion imperative

Attention is an IO problem in disguise

The naive version and the IO mistake

The IO aware pattern: tile and stream with online softmax

One practical note that makes this real in PyTorch

FlashAttention as a pattern, not a brand

Serving turns KV cache into an allocator problem

Paging is virtual memory for the GPU, in spirit

The price of indirection

The deepest layer: warps are the unit of execution

Shared memory is fast until the banks fight you

Shuffles are register-level warp collectives

Communication patterns are an algorithm

Tensor cores are fast math with strict rules

Shared memory is fast until you anger the banks

Swizzling is controlled chaos

The kernel is a pipeline, not a function

The hardware lever: asynchronous copies and barriers

Alignment is not a detail, it is part of the contract

Hopper pushes the idea further with dedicated copy engines

Warp specialization is the extreme form of the same idea

When the pipeline is wrong, the symptoms are predictable

Decode exposes launch overhead

Multi GPU introduces a new wall

The serving scheduler decides which kernels you even run

Key takeaways

The philosophy of silicon

More Articles

When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference

The Silent Collapse: Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

Kernel Dynamics: The Real Bottleneck of AI

The first split most people never make

The only accurate question

Why the Roofline model is not optional

The memory hierarchy is the real API

The memory wall, defined properly

1. Bandwidth saturation: you are moving too many bytes, too efficiently

2. Latency and irregular access: the bus is not full, but you are still slow

3. Locality collapse: everything looks like global because you lost reuse

The rule that makes this section useful

Why FLOPS often misleads

The fusion imperative

Attention is an IO problem in disguise

The naive version and the IO mistake

The IO aware pattern: tile and stream with online softmax

One practical note that makes this real in PyTorch

FlashAttention as a pattern, not a brand

Serving turns KV cache into an allocator problem

Paging is virtual memory for the GPU, in spirit

The price of indirection

The deepest layer: warps are the unit of execution

Shared memory is fast until the banks fight you

Shuffles are register-level warp collectives

Communication patterns are an algorithm

Tensor cores are fast math with strict rules

Shared memory is fast until you anger the banks

Swizzling is controlled chaos

The kernel is a pipeline, not a function

The hardware lever: asynchronous copies and barriers

Alignment is not a detail, it is part of the contract

Hopper pushes the idea further with dedicated copy engines

Warp specialization is the extreme form of the same idea

When the pipeline is wrong, the symptoms are predictable

Decode exposes launch overhead

Multi GPU introduces a new wall

The serving scheduler decides which kernels you even run

Key takeaways

The philosophy of silicon

More Articles