# The Silent Collapse: Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace > A distinguished-architect deep dive into the 12 most dangerous failure modes in AI infrastructure — from silent data corruption in GPU silicon to compiler cache poisoning, memory allocator drift, and kernel-launch corruption. Includes x86/PTX assembly analysis, Mermaid flow diagrams, a full comparative triage matrix, and a 12-month engineering roadmap with new observability primitives. - Author: Hazem Ali - Published: 2026-02-26 - Reading Time: 47 min read - Tags: AI Infrastructure, GPU, Silent Data Corruption, CUDA, Memory Architecture, Hardware Security, Compilers, Observability, Systems Architecture, Zero Trust - URL: https://drhazemali.com/blog/the-silent-collapse-deep-stack-hardware-software-failure-modes - Source: https://drhazemali.com --- # The Silent Collapse ## Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace There is a class of failure in AI systems that does not announce itself. No crash. No Xid. No stack trace. No alert. The system continues to serve. The metrics stay green. The model generates fluent, confident, plausible text. And the outputs are wrong. Not wrong in the way a hallucination is wrong — obviously, detectably, sometimes amusingly wrong. Wrong in the way a corrupted gradient is wrong: silently, systematically, in a direction you cannot distinguish from legitimate nondeterminism until you have already shipped it to production and the damage compounds across thousands of requests. I have spent over twenty years building and debugging complex systems including distributed infrastructure. I have traced failures from x86 microcode errata through kernel page-table corruption to CUDA driver bugs that only manifest under multi-process GPU sharing. My published work on Microsoft Tech Community — [The Hidden Memory Architecture of LLMs](https://techcommunity.microsoft.com/blog/educatordeveloperblog/the-hidden-memory-architecture-of-llms/4485367), [AI Didn't Break Your Production — Your Architecture Did](https://techcommunity.microsoft.com/blog/educatordeveloperblog/ai-didn%e2%80%99t-break-your-production-%e2%80%94-your-architecture-did/4482848), and my Zero Trust architectural guidance — keeps returning to the same thesis: **when AI fails in production, it is rarely because the model is weak. It is because the infrastructure contract was never specified, never verified, and never monitored.** My research and analysis on deep systems architecture, AI & Deep Learning, GPU memory hierarchies, and AI infrastructure failure modes has been referenced by engineers and the broader systems community. Not because these topics are novel in isolation, but because the complete failure surface across hardware, firmware, driver, runtime, compiler, and orchestration rarely gets assembled in one coherent framework. Most organizations only do it after the incident. Now. Before we go any further, I want to show you something. ```asm HMMA.16816.F16 R4, R0, R2, R4 ; D[frag] = A[frag] * B[frag] + C[frag] ; Register R0 holds an A-matrix fragment (FP16) ; FP16 layout: 1 sign | 5 exponent | 10 mantissa ; Value = (-1)^s × 2^(e-15) × (1 + m/1024) ; ; R0 current value: ; 0 01100 1000000000 = +1.0 × 2^(-3) = 0.125 ; ; After a single bit-flip on bit 14 (MSB of exponent): ; 0 11100 1000000000 = +1.0 × 2^(13) = 8192.0 ; ; The matmul accumulates this into the output logits. ; An attention score that should have been near-zero now dominates softmax. ; The next token prediction shifts to an entirely different token. ; No error is raised. No exception. No log entry. ; The model confidently produces the wrong answer. ``` Now, some of you are probably looking at that and wondering what on earth it is. That is fair. At my age, we spent a lot of time with code like this. It is called Assembly — specifically, this is NVIDIA Hopper SASS, the actual machine instructions that run on the tensor cores inside your GPU when your transformer layer does a matrix multiply. And if you *can* read it, and you looked at it carefully, and you did not spot the problem — let me put this very clearly. **The code is clean.** There is no bug. No typo. No off-by-one. No misaligned pointer. The instruction is correct. The register encoding is correct. The operation is correct. The issue is way more complex than you can imagine. A single bit-flip — not in the code, not in the binary, but in the *physical register* holding the value at runtime — turned 0.125 into 8192.0. And from that point forward, every downstream computation is contaminated. The attention distribution shifts. The argmax changes. The model produces a different token. And nothing in your monitoring, your logging, your alerting, or your metrics will tell you it happened. That is what Silent Data Corruption looks like from the inside. And that is what this article is about. --- **What This Article Covers:** - 12 failure modes that corrupt AI outputs without triggering any alert - Assembly-level analysis of how silent data corruption propagates through GPU pipelines - Hardware memory security implications from ECC bypass to row-hammer adjacency in HBM - Flow diagrams showing fault propagation from silicon through driver to model output - A comparative triage matrix with detectability, reproducibility, and severity ratings - New observability primitives: execution capsules, plan hashes, topology fingerprints - A 12-month engineering roadmap with owners, deliverables, and metrics The goal of this article is precise: **give platform teams a complete, prioritized catalogue of failure modes that can corrupt AI model outputs without raising alerts, along with concrete detection recipes and architectural mitigations.** This is not a debug guide. It is a governance framework for AI infrastructure correctness — built from the same Zero Trust principles I have advocated across my Microsoft publications: *never assume correctness; always verify; instrument every trust boundary.* --- ## Why AI systems fail differently Traditional distributed systems fail loudly. A database returns an error. A service returns a 500. A network partition triggers a timeout. The failure surface is well-understood, and decades of engineering have produced mature detection and recovery mechanisms. AI systems fail silently. And they fail silently because **correctness is an end-to-end property across hardware reliability, drivers, kernel libraries, allocators, compilers, and orchestration** — and no single layer owns it. Two trends amplify this: **Silent Data Corruption (SDC)** — hardware faults that evade detection yet alter numerical results. The Open Compute Project SDC whitepaper explicitly calls out the "needle in a haystack" detection challenge and the gap between low-level fault metrics and AI correctness metrics at scale. **Performance-driven dynamism** — autotuners, algorithm heuristics, caching allocators, and compilation caches that legitimately change execution plans over time. These are features, not bugs. But they create a system where the **same model, same input, same hardware** can produce different outputs depending on timing, memory pressure, concurrency, and cached state. ```mermaid graph TD A[AI Model Request] --> B{Hardware Layer} B --> C[GPU Silicon] B --> D[HBM Memory] B --> E[NVLink/PCIe] C -->|SDC: bit-flip in ALU| F[Corrupted Logits] D -->|ECC miss / row-hammer| F E -->|Link CRC miss| F F --> G{Driver Layer} G -->|Async error laundering| H[Wrong kernel attributed] G -->|Multi-process corruption| H H --> I{Runtime Layer} I -->|cuBLAS workspace drift| J[Different math path] I -->|Allocator fragmentation| J I -->|CUDA graph stale address| J J --> K{Compiler Layer} K -->|PTX JIT drift| L[Different codegen] K -->|Triton cache poison| L L --> M{Orchestration Layer} M -->|MIG undefined device| N[Wrong GPU class] M -->|NCCL topology change| N N --> O[Silent Output Corruption] style O fill:#dc2626,stroke:#991b1b,color:#fff style F fill:#f59e0b,stroke:#d97706,color:#000 style H fill:#f59e0b,stroke:#d97706,color:#000 style J fill:#f59e0b,stroke:#d97706,color:#000 style L fill:#f59e0b,stroke:#d97706,color:#000 style N fill:#f59e0b,stroke:#d97706,color:#000 ``` That diagram is the failure surface this article maps. Every arrow is a real failure mode I have either seen, debugged, or found documented in primary vendor sources. Twelve of them, across five layers. Let me walk you through each one. > We didn't change the model. > You are right. > **You changed the execution contract. And nobody wrote it down.** --- ## The Twelve Failure Modes ### 1. Silent Data Corruption That Evades Detection I opened with this one for a reason. It is the most dangerous because it is invisible. SDC occurs when hardware faults produce incorrect computational results *without triggering hardware error detection*. The Open Compute Project whitepaper highlights that AI workloads can mask these faults, making detection difficult at fleet scale, and emphasizes the mismatch between hardware metrics (FIT rates, Architectural Vulnerability Factor) and AI correctness metrics (accuracy, loss, token-level agreement). Most organizations treat "GPU error" as synonymous with explicit failures: Xid events, ECC errors, crashes. SDC breaks that assumption. It can *look like benign nondeterminism* unless you run canary invariants or cross-checks. Here is how it propagates through a transformer forward pass. The critical part: a single corrupted multiply-accumulate in the attention score computation can shift the entire probability distribution. ```c // Simplified attention score computation (one head, one query position) // In CUDA, this maps to tensor core HMMA instructions // // score[i] = sum_d(Q[q][d] * K[i][d]) / sqrt(d_k) // // If SDC corrupts one K[i][d] value during the dot product: // - The corrupted score[i] dominates after softmax // - The value vector V[i] gets disproportionate weight // - The output representation shifts // - Downstream layers amplify the error // - The argmax over the vocabulary changes // // At temperature=0, a SINGLE corrupted attention score // can deterministically produce the wrong next token. float corrupted_score = 0.0f; for (int d = 0; d < d_k; d++) { corrupted_score += Q[q][d] * K[i][d]; // <-- SDC injection point } corrupted_score /= sqrtf((float)d_k); // After softmax, this corrupted score dominates // The model confidently produces the wrong token ``` This is not hypothetical. At fleet scale — thousands of GPUs running 24/7 — the statistical expectation of silent bit-flips is non-zero. Google's published research on silent data corruption documents that SDC occurs at measurable rates in production data centers, and that SDC events are not uniformly distributed. Some silicon lots and some operating conditions produce significantly higher rates. **How to detect it**: Run a deterministic "golden micro-batch." Fix seeds, disable algorithm benchmarking, enforce deterministic cuDNN algorithms. Store checksums of intermediate tensors and logits per golden run. Alert on deviations outside a tiny tolerance *when the execution capsule fingerprint matches*. Add "shadow execution" for 0.1–1% of traffic: rerun on a different GPU/node and compare logits distance. **Symptoms**: Temperature=0 inference occasionally yields different tokens on the same prompt. Training shows rare loss spikes that nobody can attribute to data or learning rate. Convergence to a different optimum across otherwise-identical runs. > **AI Silent Data Corruption at Scale** > Open Compute Project — *Open Compute Project Whitepaper* > > Highlights that AI can mask hardware faults, making SDC detection extraordinarily difficult at fleet scale, and documents the gap between hardware reliability metrics and AI correctness metrics. > > [Read more](https://www.opencompute.org/documents/ocp-wp-sdc-in-ai-20240814-pdf) --- ### 2. Memory-Error Recovery Side Effects This one is subtle. The failure itself gets handled. The *side effects* of the handling are what hurt you. On uncorrectable contained ECC errors, the NVIDIA driver terminates the affected application, then **dynamic page offlining** marks the faulty pages unusable. Later, **row remapping** can remap the faulty row in hardware after a GPU reset, potentially reclaiming those offlined pages. Teams look for "job failed." They rarely track the follow-on effects on memory shape. Page offlining and row-remap state can alter available memory, allocator behavior, or cause pending remediation that requires a reset — feeding into algorithm and plan selection drift. Here is the chain that bites you: reduced workspace → cuDNN selects a different algorithm → that algorithm uses FP16 accumulation instead of FP32 → logits differ by enough to flip tokens on borderline cases. This is documented in cuDNN's notes on numerical accuracy varying by algorithm based on workspace availability. ``` GPU Memory Recovery State Machine: ┌──────────┐ ECC Error ┌──────────────┐ │ Normal │───────────────►│ Page Offline │ └──────────┘ └──────┬───────┘ │ Row Remap Pending │ ┌──────▼───────┐ │ GPU Reset │ └──────┬───────┘ │ ┌────────────────┼────────────┐ │ │ │ ┌──────▼──────┐ ┌─────▼─────┐ ┌──▼──────────┐ │ Remapped │ │ Remap │ │ Remap │ │ (Reclaim) │ │ Pending │ │ Failed │ └─────────────┘ └───────────┘ └─────────────┘ │ Still degraded: reduced workspace, different algo selection ``` **How to detect it**: Monitor `PAGE_RETIREMENT` and row-remap pending/failure via nvidia-smi or NVML. Correlate plan drift metrics (cuDNN/cuBLAS plan hashes) with page-retirement and row-remap counters. If you see plan changes on a GPU that recently had memory remediation, that is your signal. --- ### 3. Driver-Level Kernel-Launch Corruption Under Multi-Process Sharing This is the one that makes me lose sleep. NVIDIA Data Center GPU Driver release notes document a fixed issue: "potential corruption when launching kernels on H100 GPUs," more likely when the GPU is shared between multiple processes, manifesting as Xid 13 errors. When corruption is concurrency-sensitive, it gets misdiagnosed as model nondeterminism, user-kernel bugs, or framework issues. In a multi-tenant GPU environment — MPS-enabled or time-sliced MIG — this means that **one tenant's workload can corrupt another tenant's results**. This is not a side-channel. It is a direct correctness violation. From a Zero Trust perspective, this demands process-level isolation for correctness-critical workloads. ```c // What the driver bug looks like from userspace: // Process A launches GEMM kernel on stream 0 // Process B launches attention kernel on stream 1 // Driver internally shares launch state // Process A's kernel receives corrupted launch parameters // // Driver internal (simplified): // launch_state = acquire_shared_launch_slot(); // launch_state->grid_dim = user_grid; // Thread B races here // launch_state->block_dim = user_block; // Partial write + context switch // launch_state->params = user_params; // Now contains mixed state // dispatch_to_gpu(launch_state); // Corrupted launch // // Result: kernel executes with wrong grid dimensions or wrong parameters // Output tensor contains garbage in some regions, valid data in others // No error is raised — the kernel "completed successfully" ``` **How to detect it**: Log driver version + GPU model + multiplexing mode as part of every request capsule. Alert on output drift correlated with concurrency. Reproduce on a pinned driver: run two processes saturating GEMM and attention kernels concurrently, compare outputs to single-process baseline, and watch for Xids. > **NVIDIA Data Center GPU Driver Release Notes** > NVIDIA Corporation — *NVIDIA Documentation* > > Documents fixed issues including kernel-launch corruption under multi-process GPU sharing on H100 GPUs. > > [Read more](https://docs.nvidia.com/datacenter/tesla/gpu-driver-release-notes/) --- ### 4. Asynchronous Error Surfacing That Launders Root Cause CUDA kernel launches are asynchronous. Errors are reported at later synchronization points — `cudaMemcpy`, `cudaDeviceSynchronize`, sometimes just a random API call that happens to sync. Even benign calls may return error codes from *previous* asynchronous launches, per CUDA runtime API documentation. In inference pipelines with multiple CUDA streams, graph capture, and batched execution, the "sync gap" between a faulty kernel and the error report can span dozens of operations. If the pipeline consumes outputs before hitting a sync boundary, **corrupted logits are shipped to the client**. ```mermaid sequenceDiagram participant App as Application participant CUDA as CUDA Runtime participant GPU as GPU Hardware App->>CUDA: cudaLaunchKernel(kernelA) Note over CUDA: Returns immediately (async) CUDA->>GPU: Dispatch kernelA App->>CUDA: cudaLaunchKernel(kernelB) Note over CUDA: Returns immediately (async) CUDA->>GPU: Dispatch kernelB Note over GPU: kernelA hits illegal memory access GPU-->>CUDA: Error flagged in error buffer App->>CUDA: cudaMemcpy(result) Note over CUDA: Sync point — checks error buffer CUDA-->>App: Returns error from kernelA Note over App: Stack trace points to cudaMemcpy, not to kernelA where fault occurred Note over App: Worse case: if no sync before consuming output, corrupted data is used silently ``` The correct pattern: sync fences at correctness-critical boundaries. ```c // The dangerous pattern in AI serving: // Stream 0: attention kernel (faults silently) // Stream 1: MLP kernel (reads attention output — already corrupted) // Stream 1: softmax + argmax (produces valid-looking but wrong token) // Only AFTER the response is sent does a sync boundary catch the error // Correct pattern: cudaLaunchKernel(attention_kernel, ...); cudaStreamSynchronize(stream_0); // Fence: catch errors before consuming CHECK_CUDA_ERROR(); // Propagate immediately cudaLaunchKernel(mlp_kernel, ...); ``` --- ### 5. cuBLAS Multi-Stream Workspace Nondeterminism cuBLAS guarantees bitwise reproducibility under specific conditions, but explicitly warns that the guarantee **does not hold when multiple CUDA streams are active**. Nondeterminism arises from internal workspace selection optimizations. This matters because the transformer's entire computation is a chain of matrix multiplications. If different workspace selections produce different floating-point rounding in the attention computation, the softmax distribution shifts, and the argmax can change. "Same input, same GPU, different output" — and nobody changed anything. > **Why workspace selection changes math** > > cuBLAS selects different internal scratch-memory regions under concurrent streams, leading to different floating-point accumulation orders — which changes results due to IEEE 754 rounding. > > cuBLAS internally partitions a workspace buffer for intermediate accumulation. Under multiple concurrent CUDA streams sharing a handle, the specific partition selected depends on timing. Different partitions can lead to different accumulation orders. Since FP addition is non-associative — `(a + b) + c ≠ a + (b + c)` in floating point — different orders produce different results at the ULP (Unit in the Last Place) level. Over a full transformer layer, these ULP differences can compound and flip an argmax. > > Mitigation: `CUBLAS_WORKSPACE_CONFIG=:4096:8` forces deterministic workspace selection. Cost: ~32KB reserved workspace per stream, minor perf regression. ```python # Demonstration: cuBLAS workspace nondeterminism # WITHOUT workspace config: nondeterministic under multi-stream result_a = torch.mm(Q, K.T) # cuBLAS GEMM on stream 0 result_b = torch.mm(Q, K.T) # cuBLAS GEMM on stream 1 (concurrent) # result_a and result_b may differ at the ULP level # WITH workspace config: deterministic workspace selection os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" # Now: same inputs → same workspace → same rounding → bitwise identical ``` --- ### 6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift cuDNN states most routines are bitwise reproducible on the same architecture, but lists exceptions that are nondeterministic because they use atomic operations introducing "truly random floating point rounding errors." Across architectures, cuDNN routines do **not** guarantee bitwise reproducibility. Let me show you why from first principles. This is the kind of thing that kept me up at night twenty years ago, and it is the same physics now, just at a different scale. ```asm ; GPU atomic add (simplified SASS — NVIDIA assembly) ; When multiple threads atomically add to the same address, ; the ORDER of additions is nondeterministic. ; ; FP addition is NOT associative: ; (a + b) + c ≠ a + (b + c) in floating point ; ; Example with FP16 values: ; Thread 0: atomicAdd(&sum, 0.0001) ; Thread 1: atomicAdd(&sum, 1000.0) ; Thread 2: atomicAdd(&sum, 0.0001) ; ; Order A: (0.0001 + 1000.0) + 0.0001 = 1000.0 (0.0001 lost to rounding) ; Order B: (0.0001 + 0.0001) + 1000.0 = 1000.0 (0.0002 preserved then lost) ; Order C: 0.0001 + (1000.0 + 0.0001) = 1000.0 (different rounding path) ; ; ATOM.E.ADD.F16 R4, [R2], R0 ; Atomic FP16 add to global memory ; The execution order depends on warp scheduling — inherently nondeterministic ``` This is a fundamental mathematical property, not a bug. Oak Ridge National Laboratory's SC24 work documents that deep learning sensitivity to floating-point non-associativity can be "extreme," impacting reproducibility and certification. The practical consequence: cross-region failover is brittle. Same request, different GPU generation, different completion. Teams set seeds and think they are done. They are not. They need to track which exact cuDNN algorithms were selected and understand that "deterministic selection" is distinct from "deterministic algorithm." > **The Impact of Floating-Point Non-Associativity on Reproducibility in HPC and Deep Learning** > Oak Ridge National Laboratory — *SC24 — Supercomputing Conference* > > Documents that deep learning's sensitivity to floating-point non-associativity can be extreme, fundamentally impacting reproducibility and result certification in production systems. > > [Read more](https://www.ornl.gov/publication/impact-floating-point-non-associativity) --- ### 7. Allocator Fragmentation Forcing Algorithm/Precision Drift This one is insidious because it correlates with uptime and load patterns, not code changes. Same model, same input, same GPU — different output after hours of production traffic because memory fragmentation changed the algorithm selection. Framework allocators fragment GPU memory into "slivers" as batch sizes fluctuate. PyTorch documents how this pattern can lead to unrecoverable fragmentation without mitigation (e.g., expandable segments). Algorithm selection in cuDNN depends on available workspace — and cuDNN explicitly notes that numerical accuracy varies by algorithm based on whether extra workspace enables FP32 accumulation vs FP16. ```mermaid flowchart LR A[Fresh GPU 140GB free] -->|Load model + warm up| B[120GB allocated 20GB contiguous free] B -->|Serve varied batch sizes| C[120GB allocated 20GB fragmented into slivers] C -->|cuDNN probes workspace| D{Largest contiguous block?} D -->|≥ 8MB| E[Algorithm A FP32 accumulation Higher accuracy] D -->|< 8MB| F[Algorithm B FP16 accumulation Lower accuracy] E --> G[Logits: 0.4312, 0.4310, 0.1378] F --> H[Logits: 0.4315, 0.4307, 0.1378] G --> I[argmax → token 0] H --> J[argmax → token 1] style F fill:#dc2626,stroke:#991b1b,color:#fff style J fill:#dc2626,stroke:#991b1b,color:#fff style H fill:#f59e0b,stroke:#d97706,color:#000 ``` **How to detect it**: Create memory pressure by allocating large tensors. Run a conv/attention op. Release pressure and repeat. Log selected algorithms and workspace sizes and compare. If they differ, your memory state is influencing your math. --- ### 8. PTX JIT and Compute-Cache Invalidation CUDA fat binaries may include PTX — NVIDIA's intermediate representation. If the binary for the current GPU architecture is not present, the driver JIT-compiles PTX into SASS (the actual machine code). The driver caches the result. And here is the part everyone misses: **the compute cache is automatically invalidated when the driver is upgraded**. Teams freeze container images but allow host driver updates, assuming "container immutability implies execution immutability." PTX JIT breaks that assumption completely. The container is immutable. The driver is not. The generated machine code changes. ``` ; PTX (intermediate) — portable across GPU architectures ; This gets JIT-compiled by the driver into actual SASS: .visible .entry attention_kernel( .param .u64 param_Q, .param .u64 param_K, .param .u64 param_V ) { .reg .f16x2 %h<8>; .reg .f32 %f<16>; ld.global.v2.f16 %h0, [%rd0]; ; The driver's JIT compiler decides: ; - Instruction scheduling order ; - Register allocation ; - Memory access coalescing patterns ; - Whether to use FMA or separate MUL+ADD ; ; Different driver versions → different SASS → different rounding mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 {%f0, %f1, %f2, %f3}, {%h0, %h1, %h2, %h3}, {%h4, %h5}, {%f0, %f1, %f2, %f3}; } ``` **Symptoms**: "same container, slightly different outputs" after a driver upgrade. Cold-start regressions when JIT cache is cold. **Mitigation**: ship SASS for target architectures to reduce PTX reliance. Pin driver versions for determinism tiers. Treat driver upgrades like model upgrades — canary with deterministic golden sets. --- ### 9. Triton Kernel Cache Key Gaps Triton's cache key derives from installation hash, source hash, backend hash, options hash, and selected environment variables. A recent Intel XPU backend issue warns that cache keys may miss backend/target invalidation factors (driver, compiler, environment), leading to incorrect cache reuse, nondeterministic behavior, and subtle correctness bugs. It also flags nondeterministic `str(options)` serialization risks. From a supply-chain security standpoint, this is a real attack surface. Triton's kernel cache is a **pre-compiled binary artifact that gets loaded and executed on the GPU without re-verification**. If the cache is shared across nodes — a common optimization — a poisoned or stale cache entry can affect every node in the cluster. ```python # Triton cache structure (simplified): # $TRITON_CACHE_DIR/ # {installation_hash}/ # {source_hash}/ # {backend_hash}/ # {options_hash}/ # kernel.cubin ← Compiled GPU binary # kernel.json ← Metadata # # Attack scenario: # 1. Shared NFS cache across cluster # 2. Node A compiles kernel with driver v550.54 # 3. Node B gets driver v550.90 (security patch changes codegen) # 4. Node B loads cached cubin from Node A — WRONG BINARY # 5. No error. No warning. Different numerical behavior. ``` > **Supply Chain Risk** > > Shared Triton cache volumes across nodes with different driver versions, compiler versions, or hardware configurations can silently serve stale or incorrect compiled kernels. Treat this with the same rigor as any code-signing verification gap. --- ### 10. CUDA Graph Address-Capture Corruption CUDA graphs replay operations using the **exact memory addresses captured during recording**. If tensors are deallocated before replay, the graph accesses freed memory, causing corruption. Frameworks introduce graph-specific allocator strategies (private pools, checkpointing) that can still be misused. ```c // The dangerous pattern: void serve_request() { // Record phase cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); float* temp = allocate_tensor(1024); // Address: 0x7f3a00000000 launch_attention_kernel(temp, ...); launch_mlp_kernel(temp, ...); cudaStreamEndCapture(stream, &graph); free_tensor(temp); // temp freed! Address returned to pool // ... later, another allocation reuses 0x7f3a00000000 ... // Replay phase cudaGraphLaunch(graph_exec, stream); // Graph reads from 0x7f3a00000000 — now contains DIFFERENT data // Output is "valid" (no crash) but semantically wrong // Model produces fluent, confident, incorrect text } ``` Arbitrary, non-local corruption. You can flip a single layer output and get plausible but incorrect completions. The model does not know it is wrong. The serving framework does not know it is wrong. The user does not know it is wrong. --- ### 11. MIG "Undefined Device" Placement Heterogeneity In MIG "mixed strategy," NVIDIA's Kubernetes guidance warns that if a container requests more than one device type (e.g., `nvidia.com/gpu` plus a MIG resource), "the device received is undefined" in default setups. Different SM counts, different memory partitions → different kernel choices and different numeric behavior. Teams assume resource requests deterministically map to hardware. The undefined mapping sits in orchestration docs, not ML postmortems. I have seen output drift that perfectly correlated with placement — not model version, not code changes, not data. Placement. **How to detect it**: Deploy two pods with mixed resource requests. Log actual device UUID and MIG profile. Compare output hashes for the same prompt. Add admission control policies rejecting multi-type GPU requests unless explicitly allowed. --- ### 12. NCCL Topology/Algorithm Changes and FP Non-Associativity NCCL selects algorithms — Ring, Tree, and others — based on topology and configuration. Floating-point addition is non-associative. Reduction order changes alter outputs. NCCL has extensive environment variables controlling algorithm, protocol, and topology. Most teams never pin them. ```mermaid flowchart TD subgraph Ring["Ring AllReduce (8 GPUs)"] R0[GPU 0] --> R1[GPU 1] --> R2[GPU 2] --> R3[GPU 3] R3 --> R4[GPU 4] --> R5[GPU 5] --> R6[GPU 6] --> R7[GPU 7] R7 --> R0 end subgraph Tree["Tree AllReduce (8 GPUs)"] T0[GPU 0] --> T01[+] T1[GPU 1] --> T01 T2[GPU 2] --> T23[+] T3[GPU 3] --> T23 T01 --> T0123[+] T23 --> T0123 T4[GPU 4] --> T45[+] T5[GPU 5] --> T45 T6[GPU 6] --> T67[+] T7[GPU 7] --> T67 T45 --> T4567[+] T67 --> T4567 T0123 --> TRoot[Root +] T4567 --> TRoot end Ring -->|"FP non-associativity Different reduction order"| Result1["Result A: 1.0000001"] Tree -->|"FP non-associativity Different reduction order"| Result2["Result B: 1.0000003"] Result1 --> Diff["Different gradient updates → Different model weights → Different outputs"] Result2 --> Diff ``` Teams attribute training divergence to optimizer randomness rather than collective order and topology drift. They rarely consider that NIC selection and ring construction can change between runs, between restarts, sometimes between steps if the fabric is congested. **Mitigation**: Pin NCCL algorithm/protocol/topology. Record NCCL config and topology fingerprint per run. Add topology-aware determinism tiering — repro tiers that require stable rings/trees and deterministic reductions. --- ## The Deep Security Perspective I want to go deeper into the security surface, because the failure modes above are not just reliability concerns. They intersect with hardware security in ways that should concern anyone running AI systems that make consequential decisions. > **Row Hammer in HBM: disturbance errors in AI accelerator memory** > > High-density HBM stacks are susceptible to adjacent-row disturbance that can corrupt model weights in memory — bypassing all software-level integrity checks. > > Row hammer causes bit-flips in physically adjacent DRAM rows via repeated access. In HBM, the physics is the same: higher density means smaller cell spacing means potentially higher vulnerability. Model weights sit in HBM for the entire inference lifetime. A targeted disturbance pattern — from a co-located workload or from natural access patterns — could corrupt specific weight values. And SECDED ECC cannot catch every case: single-bit errors across multiple codewords are each individually "corrected" but the cumulative semantic effect on the model is uncorrected. Consider the x86 equivalent for context. The GPU uses different instructions but the principle is identical: ```asm ; x86 row-hammer pattern — repeated access to two addresses ; that map to the same bank but different rows hammer_loop: mov rax, [row_A] ; Access row A mov rbx, [row_B] ; Access row B (different row, same bank) clflush [row_A] ; Flush cache — force DRAM access clflush [row_B] ; Flush cache — force DRAM access mfence ; Ensure flushes complete dec rcx jnz hammer_loop ; Repeat millions of times ; After sufficient iterations, row C (between A and B) may have bit-flips ; In HBM, the TSV geometry and higher density may lower the threshold ``` This is why I advocate for **cryptographic weight verification at runtime** — not just at load time. Hash the weight tensors periodically and compare against known-good values. It is expensive. It is the only reliable defense against silent weight corruption, whether from hardware faults, row-hammer, or supply-chain tampering. ### The Zero Trust argument for GPU compute In my Zero Trust work, the core principle is: **never trust, always verify.** This applies to network boundaries, identity, and data. I argue it must also apply to compute: 1. **Never trust that the GPU computed correctly** — verify with canary workloads and cross-node comparison 2. **Never trust that the cached binary is valid** — verify with environment fingerprints and signatures 3. **Never trust that the memory is uncorrupted** — verify with periodic weight hashing 4. **Never trust that the driver is benign** — verify with deterministic golden-set regression after every upgrade 5. **Never trust that the execution plan is stable** — verify with plan-hash tracking and drift alerting This is not paranoia. This is the engineering discipline required when a silent one-bit error in a GPU register can change a medical diagnosis, a financial recommendation, or a safety-critical decision. > **Why ECC is necessary but not sufficient** > > SECDED ECC catches most single-bit errors, but multi-bit errors within a codeword, cumulative single-bit errors across codewords, and errors between scrub cycles all escape correction at the semantic level. > > ECC scrubbing is periodic, not continuous. Between scrubs, errors accumulate. Multi-bit errors within a codeword are detected but not corrected — the GPU raises an uncorrectable error. But single-bit errors across multiple codewords are each individually corrected by ECC while the cumulative effect on model weights changes model behavior. ECC says "memory is fine." The model says something different. --- ## Comparative Triage Matrix ![Figure: Silent Collapse Risk Map (Indexed). Detectability (x-axis) vs. impact severity (y-axis), with bubble size representing reproducibility difficulty. Each point is indexed (1–12) and mapped in the legend panel to a specific failure mode. The upper-left region (high impact, low detectability) captures the most dangerous class: failures that masquerade as benign nondeterminism while silently corrupting outputs and evading standard infrastructure alerts.](https://drhazemali.com/storage/silent-collapse-risk-map.jpg) **Deep-Stack Failure Mode Triage Matrix** | Failure Mode | Root Cause Layer | Detectability | Repro Difficulty | Impact Severity | Mitigation Cost | | --- | --- | :---: | :---: | :---: | :---: | | Silent Data Corruption (SDC) | Hardware / Firmware | Low | Hard | Critical | High | | Memory Recovery Side Effects | Firmware / Driver / Runtime | Medium | Medium | High | Medium | | Driver Kernel-Launch Corruption | Driver | Medium | Hard | Critical | Medium | | Async Error Laundering | Driver / Runtime | Medium | Medium | High | Medium | | cuBLAS Multi-Stream Workspace | Runtime / Kernel Library | Low | Medium | High | Medium | | cuDNN Atomic Nondeterminism | Kernel Library / Hardware | Low | Medium | High | Medium | | Allocator → Plan Drift | Runtime / Kernel Library | Low | Hard | High | Medium | | PTX JIT Cache Invalidation | Driver / Compiler / Runtime | Low | Medium | High | Medium | | Triton Cache-Key Gaps | Compiler / Runtime | Low | Hard | Critical | Medium–High | | CUDA Graph Address Reuse | Runtime / Model | Low | Medium | Critical | Medium | | MIG Undefined Device | Orchestration | Medium | Easy | High | Low–Medium | | NCCL Topology Variability | Orchestration / Kernel Library | Medium | Medium | High | Medium | *Priority ordering: SDC > Workspace/Allocator Drift > Compiler/Cache Drift. These three represent the highest impact with lowest detectability for most organizations.* ### The top three These three failure modes should be the immediate priority for any organization running AI inference at scale. **SDC** — silent wrong math at the silicon level. No error signal. No log entry. No alert. The model confidently produces incorrect output. **Workspace / Allocator-Driven Drift** — cuBLAS multi-stream workspace selection plus fragmentation-driven algorithm changes create systematic, placement-dependent semantic divergence with no faults. **Compiler / Cache Drift** — PTX JIT invalidation plus Triton cache-key gaps produce different kernels after fleet changes, and everyone misattributes it to "model randomness." If you only have budget for three workstreams, start there. --- ## Engineering Roadmap ### Month 0–2: Establish Correctness Observability Baselines **Owners**: GPU Platform/SRE, ML Serving Runtime, Observability **Month 0–2 Deliverables** | Deliverable | Owner | Success Metric | | --- | --- | --- | | GPU health ingestion (Xids, ECC, page retirement, row-remap, DCGM) | SRE | 100% of inference GPUs reporting health signals | | Determinism tier config per service (stream policy, workspace config, cuDNN flags) | ML Serving | All Tier-1 models have determinism spec | | GPU identity in request telemetry (UUID, MIG profile, driver version) | Observability | > 95% of requests have GPU identity attached | ### Month 2–4: Golden Canaries and Drift Attribution **Owners**: ML Quality, Serving Runtime, SRE Deploy a golden prompt suite at temperature=0 with stable output signatures — token IDs plus logits checksum. Run shadow execution for a small traffic slice. Compare outputs and quarantine nodes with repeated divergence. Target: drift rate per 1M requests baselined, then 10× reduction. ### Month 4–6: Deterministic Execution Capsules **Owners**: Serving Runtime, Observability The **execution capsule** is the core new primitive. It makes "what ran" reconstructible. Everything a post-mortem needs to explain why two identical requests produced different outputs, captured at request time, not after the incident. ```json { "capsule_version": "v1", "request": { "request_id": "uuid", "seed": 0, "decoding": { "temperature": 0.0, "top_p": 1.0, "max_tokens": 256 } }, "model": { "model_id": "org/model@artifact", "weights_sha256": "hex", "tokenizer_sha256": "hex" }, "hardware": { "gpu_vendor": "nvidia", "gpu_uuid": "GPU-xxxx-xxxx", "arch": "hopper", "sm_count": 132, "mig_profile": null, "driver_version": "550.54.15", "hbm_ecc_state": "enabled", "row_remap_pending": false, "retired_pages": { "sbe": 0, "dbe": 0 } }, "runtime": { "framework": "pytorch", "cuda_toolkit": "12.4", "cudnn": "9.1.0", "cublas_workspace_config": ":4096:8", "allocator": { "backend": "native", "expandable_segments": true, "max_split_size_mb": 512 } }, "plans": { "cudnn_engine_plan_hash": "hex", "cublas_algo_fingerprint": "hex", "triton_cache_key": "base32", "triton_env_fingerprint": "hex", "cuda_graph_id": "hash" }, "distributed": { "nccl_algo": "Ring", "nccl_protocol": "Simple", "topology_fingerprint": "hex" }, "outputs": { "token_ids_sha256": "hex", "logits_checksum": "hex" } } ``` Target: greater than 90% of drift incidents have complete capsule reconstruction. ### Month 6–12: Governance Gates for Fleet Changes **Owners**: Platform Engineering, Security/Supply Chain, Release Engineering **Governance Gate Requirements** | Gate | Trigger | Required Action | Blocking | | --- | --- | --- | :---: | | Driver Upgrade | New driver version on any node | Canary on golden suite; compare capsule diffs | Yes | | Triton Cache Shared | Cache volume mounted cross-node | Verify env fingerprint match; sign artifacts | Yes | | MIG Mixed Request | Pod requests multiple GPU types | Reject unless explicitly allowed by policy | Yes | | NCCL Config Change | Topology or algorithm env var change | Run all-reduce microbench; compare to baseline | Yes | | Container Image Update | New image with different CUDA/cuDNN | Full golden-suite regression | Yes | Target: zero post-upgrade drift incidents for deterministic-tier models. --- ## Observability Primitives ### Integration with OpenTelemetry GenAI OpenTelemetry GenAI semantic conventions define spans and attributes for inference requests. The right approach is to extend this with **capsule references** rather than stuffing all hardware details into span attributes. Keep the spans lightweight. Let the capsule store carry the forensic depth. Proposed signals: - `gen_ai.execution.capsule_ref` → content-addressed pointer (hash/URI) to capsule - `gen_ai.execution.plan_hashes` → small set of hashes (cuDNN/cuBLAS/Triton/NCCL) - `system.gpu.uuid`, `system.gpu.mig_profile`, `system.gpu.driver_version` → low-cardinality routing and debug fields ```mermaid sequenceDiagram participant U as Client participant S as Inference Service participant R as Runtime (PyTorch + CUDA) participant D as Driver / Firmware participant M as GPU Monitor (DCGM) participant O as OTel Collector participant C as Capsule Store participant A as Drift Alerter U->>S: request(prompt, seed=0, temp=0) S->>R: execute(model, input) R->>D: launch kernels (streams/graphs) D-->>R: completion + deferred errors M-->>S: health signals (Xid, ECC, retirement) S->>C: write ExecutionCapsule S->>O: emit GenAI span + capsule_ref S-->>U: response(tokens) + X-Capsule-Ref header C->>A: stream capsule diffs A-->>S: alert on plan-hash drift A-->>M: correlate drift with GPU RAS events ``` ### Supply Chain: SBOM and AIBOM The same capsule principle extends to AI supply chain governance. Connect runtime dependencies (SBOM) and AI artifacts (AIBOM) to execution capsules. This creates a complete chain: model provenance → runtime environment → execution plan → output verification. Relevant standards: NTIA SBOM minimum elements for baseline disclosure. SPDX 3.0.1 with AI and Dataset profiles. AIBOM as a first-class supply-chain element via SPDX extension. IETF draft EAT profile for AI agents referencing SBOM/AIBOM via attestation claims. --- ## Cross-Vendor Determinism One more thing that catches teams when they try to run the same model across AMD and NVIDIA GPUs. AMD's hipBLAS documents that some functions may use atomic ops to increase performance, causing results to not be bit-wise reproducible. The backend defaults differ: rocBLAS may allow atomics by default while cuBLAS disallows them. This means **the same PyTorch code running on AMD vs NVIDIA GPUs may have different determinism defaults**. **Cross-Vendor Determinism Comparison** | Property | NVIDIA (cuBLAS/cuDNN) | AMD (rocBLAS/MIOpen) | | --- | --- | --- | | Atomics default | Disabled (deterministic) | Enabled (nondeterministic) | | Same-arch bitwise repro | Guaranteed (with conditions) | Guaranteed (with conditions) | | Cross-arch bitwise repro | NOT guaranteed | NOT guaranteed | | Determinism env var | CUBLAS_WORKSPACE_CONFIG | ROCBLAS_LAYER (partial) | | Multi-stream guarantee | NOT guaranteed | NOT guaranteed | If you are running a multi-vendor fleet, you need vendor-aware determinism policies. The execution capsule should capture which vendor and backend are active, and your drift alerting should account for the different defaults. --- ## Closing I started this article with a piece of assembly code and told you the code was clean. The issue was never the code. It was the substrate. The silicon. The physics. The contracts between layers that nobody wrote down and nobody monitors. That is the real lesson here. AI systems are the most complex software-hardware integration problem we have built at scale, and we are running them on infrastructure assumptions inherited from an era when "the hardware works" was a reasonable default. It is not anymore. If you take one thing from this article, let it be this: **correctness in AI is not a model property. It is an infrastructure property. And infrastructure correctness only exists if you build it, verify it, and monitor it across every layer of the stack.** That starts with the assembly. It ends with the governance gates. And everything in between is where the silent collapses happen. --- ## Primary References > **The Hidden Memory Architecture of LLMs** > Hazem Ali — *Microsoft Tech Community* > > Deep dive into GPU memory hierarchies, HBM access patterns, and why memory architecture — not FLOPS — dominates LLM performance. Foundational context for understanding how memory-level faults propagate to model outputs. > > [Read more](https://techcommunity.microsoft.com/blog/educatordeveloperblog/the-hidden-memory-architecture-of-llms/4485367) > **AI Didn't Break Your Production — Your Architecture Did** > Hazem Ali — *Microsoft Tech Community* > > Establishes that production AI failures trace to architectural gaps, not model weakness — the same thesis that motivates infrastructure-level correctness verification in this article. > > [Read more](https://techcommunity.microsoft.com/blog/educatordeveloperblog/ai-didn%e2%80%99t-break-your-production-%e2%80%94-your-architecture-did/4482848) > **OCP Silent Data Corruption in AI Workloads** > Open Compute Project — *OCP Whitepaper* > > Industry whitepaper documenting SDC rates, detection challenges, and the gap between hardware reliability metrics and AI correctness metrics at fleet scale. > > [Read more](https://www.opencompute.org/documents/ocp-wp-sdc-in-ai-20240814-pdf) > **cuBLAS Reproducibility and Determinism** > NVIDIA Corporation — *NVIDIA CUDA Documentation* > > Official documentation of cuBLAS determinism guarantees, multi-stream workspace nondeterminism, and CUBLAS_WORKSPACE_CONFIG mitigation. > > [Read more](https://docs.nvidia.com/cuda/cublas/index.html#reproducibility) --- **Peer-Reviewed By:** - [**Jamel Abed**](https://mvp.microsoft.com/en-US/MVP/profile/60bc6923-7983-400d-9355-39dcd4cf247c) — Microsoft MVP, Product Evangelist *This article is part of a series on deep systems architecture for AI. Related reading:* - *[Kernel Dynamics: The Real Bottleneck of AI](/blog/kernel-dynamics-the-real-bottleneck-of-ai) — prefill vs decode, memory walls, and GPU pipeline design* - *[When Your LLM Trips the MMU](/blog/when-your-llm-trips-the-mmu) — page faults, TLB shootdowns, and the virtual memory tax of AI inference* - *[AI as a Worker, Not an Engineer](/blog/ai-as-worker-not-engineer) — the hidden ceilings of AI coding agents* - *[QSAF: Qorvex Security AI Framework](/blog/qsaf-qorvex-security-ai-framework) — 63 controls across 9 domains for AI security*