The Silent Collapse

Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

There is a class of failure in AI systems that does not announce itself.

No crash. No Xid. No stack trace. No alert. The system continues to serve. The metrics stay green. The model generates fluent, confident, plausible text. And the outputs are wrong.

Not wrong in the way a hallucination is wrong — obviously, detectably, sometimes amusingly wrong. Wrong in the way a corrupted gradient is wrong: silently, systematically, in a direction you cannot distinguish from legitimate nondeterminism until you have already shipped it to production and the damage compounds across thousands of requests.

I have spent over twenty years building and debugging complex systems including distributed infrastructure. I have traced failures from x86 microcode errata through kernel page-table corruption to CUDA driver bugs that only manifest under multi-process GPU sharing. My published work on Microsoft Tech Community — The Hidden Memory Architecture of LLMs, AI Didn't Break Your Production — Your Architecture Did, and my Zero Trust architectural guidance — keeps returning to the same thesis: when AI fails in production, it is rarely because the model is weak. It is because the infrastructure contract was never specified, never verified, and never monitored.

My research and analysis on deep systems architecture, AI & Deep Learning, GPU memory hierarchies, and AI infrastructure failure modes has been referenced by engineers and the broader systems community. Not because these topics are novel in isolation, but because the complete failure surface across hardware, firmware, driver, runtime, compiler, and orchestration rarely gets assembled in one coherent framework. Most organizations only do it after the incident.

Now. Before we go any further, I want to show you something.

asm

HMMA.16816.F16  R4, R0, R2, R4    ; D[frag] = A[frag] * B[frag] + C[frag]
 
; Register R0 holds an A-matrix fragment (FP16)
; FP16 layout: 1 sign | 5 exponent | 10 mantissa
; Value = (-1)^s × 2^(e-15) × (1 + m/1024)
;
; R0 current value:
;   0 01100 1000000000 = +1.0 × 2^(-3) = 0.125
;
; After a single bit-flip on bit 14 (MSB of exponent):
;   0 11100 1000000000 = +1.0 × 2^(13)  = 8192.0
;
; The matmul accumulates this into the output logits.
; An attention score that should have been near-zero now dominates softmax.
; The next token prediction shifts to an entirely different token.
; No error is raised. No exception. No log entry.
; The model confidently produces the wrong answer.

Now, some of you are probably looking at that and wondering what on earth it is. That is fair. At my age, we spent a lot of time with code like this. It is called Assembly — specifically, this is NVIDIA Hopper SASS, the actual machine instructions that run on the tensor cores inside your GPU when your transformer layer does a matrix multiply.

And if you can read it, and you looked at it carefully, and you did not spot the problem — let me put this very clearly.

The code is clean.

There is no bug. No typo. No off-by-one. No misaligned pointer. The instruction is correct. The register encoding is correct. The operation is correct.

The issue is way more complex than you can imagine.

A single bit-flip — not in the code, not in the binary, but in the physical register holding the value at runtime — turned 0.125 into 8192.0. And from that point forward, every downstream computation is contaminated. The attention distribution shifts. The argmax changes. The model produces a different token. And nothing in your monitoring, your logging, your alerting, or your metrics will tell you it happened.

That is what Silent Data Corruption looks like from the inside. And that is what this article is about.

The goal of this article is precise: give platform teams a complete, prioritized catalogue of failure modes that can corrupt AI model outputs without raising alerts, along with concrete detection recipes and architectural mitigations.

This is not a debug guide. It is a governance framework for AI infrastructure correctness — built from the same Zero Trust principles I have advocated across my Microsoft publications: never assume correctness; always verify; instrument every trust boundary.

Why AI systems fail differently

Traditional distributed systems fail loudly. A database returns an error. A service returns a 500. A network partition triggers a timeout. The failure surface is well-understood, and decades of engineering have produced mature detection and recovery mechanisms.

AI systems fail silently. And they fail silently because correctness is an end-to-end property across hardware reliability, drivers, kernel libraries, allocators, compilers, and orchestration — and no single layer owns it.

Two trends amplify this:

Silent Data Corruption (SDC) — hardware faults that evade detection yet alter numerical results. The Open Compute Project SDC whitepaper explicitly calls out the "needle in a haystack" detection challenge and the gap between low-level fault metrics and AI correctness metrics at scale.

Performance-driven dynamism — autotuners, algorithm heuristics, caching allocators, and compilation caches that legitimately change execution plans over time. These are features, not bugs. But they create a system where the same model, same input, same hardware can produce different outputs depending on timing, memory pressure, concurrency, and cached state.

Loading diagram…

That diagram is the failure surface this article maps. Every arrow is a real failure mode I have either seen, debugged, or found documented in primary vendor sources. Twelve of them, across five layers. Let me walk you through each one.

We didn't change the model. You are right. You changed the execution contract. And nobody wrote it down.

The Twelve Failure Modes

1. Silent Data Corruption That Evades Detection

I opened with this one for a reason. It is the most dangerous because it is invisible.

SDC occurs when hardware faults produce incorrect computational results without triggering hardware error detection. The Open Compute Project whitepaper highlights that AI workloads can mask these faults, making detection difficult at fleet scale, and emphasizes the mismatch between hardware metrics (FIT rates, Architectural Vulnerability Factor) and AI correctness metrics (accuracy, loss, token-level agreement).

Most organizations treat "GPU error" as synonymous with explicit failures: Xid events, ECC errors, crashes. SDC breaks that assumption. It can look like benign nondeterminism unless you run canary invariants or cross-checks.

Here is how it propagates through a transformer forward pass. The critical part: a single corrupted multiply-accumulate in the attention score computation can shift the entire probability distribution.

// Simplified attention score computation (one head, one query position)
// In CUDA, this maps to tensor core HMMA instructions
//
// score[i] = sum_d(Q[q][d] * K[i][d]) / sqrt(d_k)
//
// If SDC corrupts one K[i][d] value during the dot product:
//   - The corrupted score[i] dominates after softmax
//   - The value vector V[i] gets disproportionate weight
//   - The output representation shifts
//   - Downstream layers amplify the error
//   - The argmax over the vocabulary changes
//
// At temperature=0, a SINGLE corrupted attention score
// can deterministically produce the wrong next token.
 
float corrupted_score = 0.0f;
for (int d = 0; d < d_k; d++) {
    corrupted_score += Q[q][d] * K[i][d];  // <-- SDC injection point
}
corrupted_score /= sqrtf((float)d_k);
// After softmax, this corrupted score dominates
// The model confidently produces the wrong token

This is not hypothetical. At fleet scale — thousands of GPUs running 24/7 — the statistical expectation of silent bit-flips is non-zero. Google's published research on silent data corruption documents that SDC occurs at measurable rates in production data centers, and that SDC events are not uniformly distributed. Some silicon lots and some operating conditions produce significantly higher rates.

How to detect it: Run a deterministic "golden micro-batch." Fix seeds, disable algorithm benchmarking, enforce deterministic cuDNN algorithms. Store checksums of intermediate tensors and logits per golden run. Alert on deviations outside a tiny tolerance when the execution capsule fingerprint matches. Add "shadow execution" for 0.1–1% of traffic: rerun on a different GPU/node and compare logits distance.

Symptoms: Temperature=0 inference occasionally yields different tokens on the same prompt. Training shows rare loss spikes that nobody can attribute to data or learning rate. Convergence to a different optimum across otherwise-identical runs.

2. Memory-Error Recovery Side Effects

This one is subtle. The failure itself gets handled. The side effects of the handling are what hurt you.

On uncorrectable contained ECC errors, the NVIDIA driver terminates the affected application, then dynamic page offlining marks the faulty pages unusable. Later, row remapping can remap the faulty row in hardware after a GPU reset, potentially reclaiming those offlined pages.

Teams look for "job failed." They rarely track the follow-on effects on memory shape. Page offlining and row-remap state can alter available memory, allocator behavior, or cause pending remediation that requires a reset — feeding into algorithm and plan selection drift.

Here is the chain that bites you: reduced workspace → cuDNN selects a different algorithm → that algorithm uses FP16 accumulation instead of FP32 → logits differ by enough to flip tokens on borderline cases. This is documented in cuDNN's notes on numerical accuracy varying by algorithm based on workspace availability.

plaintext

GPU Memory Recovery State Machine:
                                                    
  ┌──────────┐    ECC Error    ┌──────────────┐     
  │  Normal   │───────────────►│  Page Offline │     
  └──────────┘                 └──────┬───────┘     
                                      │              
                               Row Remap Pending     
                                      │              
                               ┌──────▼───────┐     
                               │  GPU Reset    │     
                               └──────┬───────┘     
                                      │              
                     ┌────────────────┼────────────┐
                     │                │            │
              ┌──────▼──────┐  ┌─────▼─────┐  ┌──▼──────────┐
              │  Remapped   │  │  Remap     │  │  Remap      │
              │  (Reclaim)  │  │  Pending   │  │  Failed     │
              └─────────────┘  └───────────┘  └─────────────┘
                                     │
                              Still degraded:
                           reduced workspace,
                          different algo selection

How to detect it: Monitor PAGE_RETIREMENT and row-remap pending/failure via nvidia-smi or NVML. Correlate plan drift metrics (cuDNN/cuBLAS plan hashes) with page-retirement and row-remap counters. If you see plan changes on a GPU that recently had memory remediation, that is your signal.

This is the one that makes me lose sleep.

NVIDIA Data Center GPU Driver release notes document a fixed issue: "potential corruption when launching kernels on H100 GPUs," more likely when the GPU is shared between multiple processes, manifesting as Xid 13 errors. When corruption is concurrency-sensitive, it gets misdiagnosed as model nondeterminism, user-kernel bugs, or framework issues.

In a multi-tenant GPU environment — MPS-enabled or time-sliced MIG — this means that one tenant's workload can corrupt another tenant's results. This is not a side-channel. It is a direct correctness violation. From a Zero Trust perspective, this demands process-level isolation for correctness-critical workloads.

// What the driver bug looks like from userspace:
// Process A launches GEMM kernel on stream 0
// Process B launches attention kernel on stream 1
// Driver internally shares launch state
// Process A's kernel receives corrupted launch parameters
//
// Driver internal (simplified):
// launch_state = acquire_shared_launch_slot();
// launch_state->grid_dim = user_grid;    // Thread B races here
// launch_state->block_dim = user_block;  // Partial write + context switch
// launch_state->params = user_params;    // Now contains mixed state
// dispatch_to_gpu(launch_state);         // Corrupted launch
//
// Result: kernel executes with wrong grid dimensions or wrong parameters
// Output tensor contains garbage in some regions, valid data in others
// No error is raised — the kernel "completed successfully"

How to detect it: Log driver version + GPU model + multiplexing mode as part of every request capsule. Alert on output drift correlated with concurrency. Reproduce on a pinned driver: run two processes saturating GEMM and attention kernels concurrently, compare outputs to single-process baseline, and watch for Xids.

4. Asynchronous Error Surfacing That Launders Root Cause

CUDA kernel launches are asynchronous. Errors are reported at later synchronization points — cudaMemcpy, cudaDeviceSynchronize, sometimes just a random API call that happens to sync. Even benign calls may return error codes from previous asynchronous launches, per CUDA runtime API documentation.

In inference pipelines with multiple CUDA streams, graph capture, and batched execution, the "sync gap" between a faulty kernel and the error report can span dozens of operations. If the pipeline consumes outputs before hitting a sync boundary, corrupted logits are shipped to the client.

Loading diagram…

The correct pattern: sync fences at correctness-critical boundaries.

// The dangerous pattern in AI serving:
// Stream 0: attention kernel (faults silently)
// Stream 1: MLP kernel (reads attention output — already corrupted)
// Stream 1: softmax + argmax (produces valid-looking but wrong token)
// Only AFTER the response is sent does a sync boundary catch the error
 
// Correct pattern:
cudaLaunchKernel(attention_kernel, ...);
cudaStreamSynchronize(stream_0);  // Fence: catch errors before consuming
CHECK_CUDA_ERROR();               // Propagate immediately
cudaLaunchKernel(mlp_kernel, ...);

5. cuBLAS Multi-Stream Workspace Nondeterminism

cuBLAS guarantees bitwise reproducibility under specific conditions, but explicitly warns that the guarantee does not hold when multiple CUDA streams are active. Nondeterminism arises from internal workspace selection optimizations.

This matters because the transformer's entire computation is a chain of matrix multiplications. If different workspace selections produce different floating-point rounding in the attention computation, the softmax distribution shifts, and the argmax can change. "Same input, same GPU, different output" — and nobody changed anything.

Systems / Runtime

Why workspace selection changes math

Impact: cuBLAS selects different internal scratch-memory regions under concurrent streams, leading to different floating-point accumulation orders — which changes results due to IEEE 754 rounding.

cuBLAS internally partitions a workspace buffer for intermediate accumulation. Under multiple concurrent CUDA streams sharing a handle, the specific partition selected depends on timing. Different partitions can lead to different accumulation orders. Since FP addition is non-associative — (a + b) + c ≠ a + (b + c) in floating point — different orders produce different results at the ULP (Unit in the Last Place) level. Over a full transformer layer, these ULP differences can compound and flip an argmax.

Mitigation: CUBLAS_WORKSPACE_CONFIG=:4096:8 forces deterministic workspace selection. Cost: ~32KB reserved workspace per stream, minor perf regression.

python

# Demonstration: cuBLAS workspace nondeterminism
import torch, os
 
# WITHOUT workspace config: nondeterministic under multi-stream
result_a = torch.mm(Q, K.T)  # cuBLAS GEMM on stream 0
result_b = torch.mm(Q, K.T)  # cuBLAS GEMM on stream 1 (concurrent)
# result_a and result_b may differ at the ULP level
 
# WITH workspace config: deterministic workspace selection
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
# Now: same inputs → same workspace → same rounding → bitwise identical

6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift

cuDNN states most routines are bitwise reproducible on the same architecture, but lists exceptions that are nondeterministic because they use atomic operations introducing "truly random floating point rounding errors." Across architectures, cuDNN routines do not guarantee bitwise reproducibility.

Let me show you why from first principles. This is the kind of thing that kept me up at night twenty years ago, and it is the same physics now, just at a different scale.

asm

; GPU atomic add (simplified SASS — NVIDIA assembly)
; When multiple threads atomically add to the same address,
; the ORDER of additions is nondeterministic.
;
; FP addition is NOT associative:
;   (a + b) + c ≠ a + (b + c)  in floating point
;
; Example with FP16 values:
;   Thread 0: atomicAdd(&sum, 0.0001)
;   Thread 1: atomicAdd(&sum, 1000.0)
;   Thread 2: atomicAdd(&sum, 0.0001)
;
; Order A: (0.0001 + 1000.0) + 0.0001 = 1000.0  (0.0001 lost to rounding)
; Order B: (0.0001 + 0.0001) + 1000.0 = 1000.0  (0.0002 preserved then lost)
; Order C:  0.0001 + (1000.0 + 0.0001) = 1000.0  (different rounding path)
;
; ATOM.E.ADD.F16 R4, [R2], R0  ; Atomic FP16 add to global memory
; The execution order depends on warp scheduling — inherently nondeterministic

This is a fundamental mathematical property, not a bug. Oak Ridge National Laboratory's SC24 work documents that deep learning sensitivity to floating-point non-associativity can be "extreme," impacting reproducibility and certification.

The practical consequence: cross-region failover is brittle. Same request, different GPU generation, different completion. Teams set seeds and think they are done. They are not. They need to track which exact cuDNN algorithms were selected and understand that "deterministic selection" is distinct from "deterministic algorithm."

7. Allocator Fragmentation Forcing Algorithm/Precision Drift

This one is insidious because it correlates with uptime and load patterns, not code changes. Same model, same input, same GPU — different output after hours of production traffic because memory fragmentation changed the algorithm selection.

Framework allocators fragment GPU memory into "slivers" as batch sizes fluctuate. PyTorch documents how this pattern can lead to unrecoverable fragmentation without mitigation (e.g., expandable segments). Algorithm selection in cuDNN depends on available workspace — and cuDNN explicitly notes that numerical accuracy varies by algorithm based on whether extra workspace enables FP32 accumulation vs FP16.

Loading diagram…

How to detect it: Create memory pressure by allocating large tensors. Run a conv/attention op. Release pressure and repeat. Log selected algorithms and workspace sizes and compare. If they differ, your memory state is influencing your math.

8. PTX JIT and Compute-Cache Invalidation

CUDA fat binaries may include PTX — NVIDIA's intermediate representation. If the binary for the current GPU architecture is not present, the driver JIT-compiles PTX into SASS (the actual machine code). The driver caches the result. And here is the part everyone misses: the compute cache is automatically invalidated when the driver is upgraded.

Teams freeze container images but allow host driver updates, assuming "container immutability implies execution immutability." PTX JIT breaks that assumption completely. The container is immutable. The driver is not. The generated machine code changes.

plaintext

; PTX (intermediate) — portable across GPU architectures
; This gets JIT-compiled by the driver into actual SASS:
 
.visible .entry attention_kernel(
    .param .u64 param_Q,
    .param .u64 param_K,
    .param .u64 param_V
)
{
    .reg .f16x2 %h<8>;
    .reg .f32   %f<16>;
    
    ld.global.v2.f16 %h0, [%rd0];
    
    ; The driver's JIT compiler decides:
    ;   - Instruction scheduling order
    ;   - Register allocation
    ;   - Memory access coalescing patterns
    ;   - Whether to use FMA or separate MUL+ADD
    ;
    ; Different driver versions → different SASS → different rounding
    
    mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
        {%f0, %f1, %f2, %f3},
        {%h0, %h1, %h2, %h3},
        {%h4, %h5},
        {%f0, %f1, %f2, %f3};
}

Symptoms: "same container, slightly different outputs" after a driver upgrade. Cold-start regressions when JIT cache is cold. Mitigation: ship SASS for target architectures to reduce PTX reliance. Pin driver versions for determinism tiers. Treat driver upgrades like model upgrades — canary with deterministic golden sets.

9. Triton Kernel Cache Key Gaps

Triton's cache key derives from installation hash, source hash, backend hash, options hash, and selected environment variables. A recent Intel XPU backend issue warns that cache keys may miss backend/target invalidation factors (driver, compiler, environment), leading to incorrect cache reuse, nondeterministic behavior, and subtle correctness bugs. It also flags nondeterministic str(options) serialization risks.

From a supply-chain security standpoint, this is a real attack surface. Triton's kernel cache is a pre-compiled binary artifact that gets loaded and executed on the GPU without re-verification. If the cache is shared across nodes — a common optimization — a poisoned or stale cache entry can affect every node in the cluster.

python

# Triton cache structure (simplified):
# $TRITON_CACHE_DIR/
#   {installation_hash}/
#     {source_hash}/
#       {backend_hash}/
#         {options_hash}/
#           kernel.cubin      ← Compiled GPU binary
#           kernel.json       ← Metadata
#
# Attack scenario:
# 1. Shared NFS cache across cluster
# 2. Node A compiles kernel with driver v550.54
# 3. Node B gets driver v550.90 (security patch changes codegen)
# 4. Node B loads cached cubin from Node A — WRONG BINARY
# 5. No error. No warning. Different numerical behavior.

10. CUDA Graph Address-Capture Corruption

CUDA graphs replay operations using the exact memory addresses captured during recording. If tensors are deallocated before replay, the graph accesses freed memory, causing corruption. Frameworks introduce graph-specific allocator strategies (private pools, checkpointing) that can still be misused.

// The dangerous pattern:
void serve_request() {
    // Record phase
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    
    float* temp = allocate_tensor(1024);  // Address: 0x7f3a00000000
    launch_attention_kernel(temp, ...);
    launch_mlp_kernel(temp, ...);
    
    cudaStreamEndCapture(stream, &graph);
    free_tensor(temp);  // temp freed! Address returned to pool
    
    // ... later, another allocation reuses 0x7f3a00000000 ...
    
    // Replay phase
    cudaGraphLaunch(graph_exec, stream);
    // Graph reads from 0x7f3a00000000 — now contains DIFFERENT data
    // Output is "valid" (no crash) but semantically wrong
    // Model produces fluent, confident, incorrect text
}

Arbitrary, non-local corruption. You can flip a single layer output and get plausible but incorrect completions. The model does not know it is wrong. The serving framework does not know it is wrong. The user does not know it is wrong.

11. MIG "Undefined Device" Placement Heterogeneity

In MIG "mixed strategy," NVIDIA's Kubernetes guidance warns that if a container requests more than one device type (e.g., nvidia.com/gpu plus a MIG resource), "the device received is undefined" in default setups. Different SM counts, different memory partitions → different kernel choices and different numeric behavior.

Teams assume resource requests deterministically map to hardware. The undefined mapping sits in orchestration docs, not ML postmortems. I have seen output drift that perfectly correlated with placement — not model version, not code changes, not data. Placement.

How to detect it: Deploy two pods with mixed resource requests. Log actual device UUID and MIG profile. Compare output hashes for the same prompt. Add admission control policies rejecting multi-type GPU requests unless explicitly allowed.

12. NCCL Topology/Algorithm Changes and FP Non-Associativity

NCCL selects algorithms — Ring, Tree, and others — based on topology and configuration. Floating-point addition is non-associative. Reduction order changes alter outputs. NCCL has extensive environment variables controlling algorithm, protocol, and topology. Most teams never pin them.

Loading diagram…

Teams attribute training divergence to optimizer randomness rather than collective order and topology drift. They rarely consider that NIC selection and ring construction can change between runs, between restarts, sometimes between steps if the fabric is congested.

Mitigation: Pin NCCL algorithm/protocol/topology. Record NCCL config and topology fingerprint per run. Add topology-aware determinism tiering — repro tiers that require stable rings/trees and deterministic reductions.

The Deep Security Perspective

I want to go deeper into the security surface, because the failure modes above are not just reliability concerns. They intersect with hardware security in ways that should concern anyone running AI systems that make consequential decisions.

Silicon / Hardware

Row Hammer in HBM: disturbance errors in AI accelerator memory

Impact: High-density HBM stacks are susceptible to adjacent-row disturbance that can corrupt model weights in memory — bypassing all software-level integrity checks.

Row hammer causes bit-flips in physically adjacent DRAM rows via repeated access. In HBM, the physics is the same: higher density means smaller cell spacing means potentially higher vulnerability. Model weights sit in HBM for the entire inference lifetime. A targeted disturbance pattern — from a co-located workload or from natural access patterns — could corrupt specific weight values. And SECDED ECC cannot catch every case: single-bit errors across multiple codewords are each individually "corrected" but the cumulative semantic effect on the model is uncorrected.

Consider the x86 equivalent for context. The GPU uses different instructions but the principle is identical:

asm

; x86 row-hammer pattern — repeated access to two addresses
; that map to the same bank but different rows
 
hammer_loop:
    mov rax, [row_A]        ; Access row A
    mov rbx, [row_B]        ; Access row B (different row, same bank)
    clflush [row_A]         ; Flush cache — force DRAM access
    clflush [row_B]         ; Flush cache — force DRAM access
    mfence                  ; Ensure flushes complete
    dec rcx
    jnz hammer_loop         ; Repeat millions of times
    
; After sufficient iterations, row C (between A and B) may have bit-flips
; In HBM, the TSV geometry and higher density may lower the threshold

This is why I advocate for cryptographic weight verification at runtime — not just at load time. Hash the weight tensors periodically and compare against known-good values. It is expensive. It is the only reliable defense against silent weight corruption, whether from hardware faults, row-hammer, or supply-chain tampering.

The Zero Trust argument for GPU compute

In my Zero Trust work, the core principle is: never trust, always verify. This applies to network boundaries, identity, and data. I argue it must also apply to compute:

Never trust that the GPU computed correctly — verify with canary workloads and cross-node comparison
Never trust that the cached binary is valid — verify with environment fingerprints and signatures
Never trust that the memory is uncorrupted — verify with periodic weight hashing
Never trust that the driver is benign — verify with deterministic golden-set regression after every upgrade
Never trust that the execution plan is stable — verify with plan-hash tracking and drift alerting

This is not paranoia. This is the engineering discipline required when a silent one-bit error in a GPU register can change a medical diagnosis, a financial recommendation, or a safety-critical decision.

Comparative Triage Matrix

Figure: Silent Collapse Risk Map (Indexed). Detectability (x-axis) vs. impact severity (y-axis), with bubble size representing reproducibility difficulty. Each point is indexed (1–12) and mapped in the legend panel to a specific failure mode. The upper-left region (high impact, low detectability) captures the most dangerous class: failures that masquerade as benign nondeterminism while silently corrupting outputs and evading standard infrastructure alerts.

Deep-Stack Failure Mode Triage Matrix

Failure Mode	Root Cause Layer	Detectability	Repro Difficulty	Impact Severity	Mitigation Cost
Silent Data Corruption (SDC)	Hardware / Firmware	Low	Hard	Critical	High
Memory Recovery Side Effects	Firmware / Driver / Runtime	Medium	Medium	High	Medium
Driver Kernel-Launch Corruption	Driver	Medium	Hard	Critical	Medium
Async Error Laundering	Driver / Runtime	Medium	Medium	High	Medium
cuBLAS Multi-Stream Workspace	Runtime / Kernel Library	Low	Medium	High	Medium
cuDNN Atomic Nondeterminism	Kernel Library / Hardware	Low	Medium	High	Medium
Allocator → Plan Drift	Runtime / Kernel Library	Low	Hard	High	Medium
PTX JIT Cache Invalidation	Driver / Compiler / Runtime	Low	Medium	High	Medium
Triton Cache-Key Gaps	Compiler / Runtime	Low	Hard	Critical	Medium–High
CUDA Graph Address Reuse	Runtime / Model	Low	Medium	Critical	Medium
MIG Undefined Device	Orchestration	Medium	Easy	High	Low–Medium
NCCL Topology Variability	Orchestration / Kernel Library	Medium	Medium	High	Medium

Priority ordering: SDC > Workspace/Allocator Drift > Compiler/Cache Drift. These three represent the highest impact with lowest detectability for most organizations.

The top three

These three failure modes should be the immediate priority for any organization running AI inference at scale.

SDC — silent wrong math at the silicon level. No error signal. No log entry. No alert. The model confidently produces incorrect output.

Workspace / Allocator-Driven Drift — cuBLAS multi-stream workspace selection plus fragmentation-driven algorithm changes create systematic, placement-dependent semantic divergence with no faults.

Compiler / Cache Drift — PTX JIT invalidation plus Triton cache-key gaps produce different kernels after fleet changes, and everyone misattributes it to "model randomness."

If you only have budget for three workstreams, start there.

Engineering Roadmap

Month 0–2: Establish Correctness Observability Baselines

Owners: GPU Platform/SRE, ML Serving Runtime, Observability

Month 0–2 Deliverables

Deliverable	Owner	Success Metric
GPU health ingestion (Xids, ECC, page retirement, row-remap, DCGM)	SRE	100% of inference GPUs reporting health signals
Determinism tier config per service (stream policy, workspace config, cuDNN flags)	ML Serving	All Tier-1 models have determinism spec
GPU identity in request telemetry (UUID, MIG profile, driver version)	Observability	> 95% of requests have GPU identity attached

Month 2–4: Golden Canaries and Drift Attribution

Owners: ML Quality, Serving Runtime, SRE

Deploy a golden prompt suite at temperature=0 with stable output signatures — token IDs plus logits checksum. Run shadow execution for a small traffic slice. Compare outputs and quarantine nodes with repeated divergence.

Target: drift rate per 1M requests baselined, then 10× reduction.

Month 4–6: Deterministic Execution Capsules

Owners: Serving Runtime, Observability

The execution capsule is the core new primitive. It makes "what ran" reconstructible. Everything a post-mortem needs to explain why two identical requests produced different outputs, captured at request time, not after the incident.

json

{
  "capsule_version": "v1",
  "request": {
    "request_id": "uuid",
    "seed": 0,
    "decoding": { "temperature": 0.0, "top_p": 1.0, "max_tokens": 256 }
  },
  "model": {
    "model_id": "org/model@artifact",
    "weights_sha256": "hex",
    "tokenizer_sha256": "hex"
  },
  "hardware": {
    "gpu_vendor": "nvidia",
    "gpu_uuid": "GPU-xxxx-xxxx",
    "arch": "hopper",
    "sm_count": 132,
    "mig_profile": null,
    "driver_version": "550.54.15",
    "hbm_ecc_state": "enabled",
    "row_remap_pending": false,
    "retired_pages": { "sbe": 0, "dbe": 0 }
  },
  "runtime": {
    "framework": "pytorch",
    "cuda_toolkit": "12.4",
    "cudnn": "9.1.0",
    "cublas_workspace_config": ":4096:8",
    "allocator": {
      "backend": "native",
      "expandable_segments": true,
      "max_split_size_mb": 512
    }
  },
  "plans": {
    "cudnn_engine_plan_hash": "hex",
    "cublas_algo_fingerprint": "hex",
    "triton_cache_key": "base32",
    "triton_env_fingerprint": "hex",
    "cuda_graph_id": "hash"
  },
  "distributed": {
    "nccl_algo": "Ring",
    "nccl_protocol": "Simple",
    "topology_fingerprint": "hex"
  },
  "outputs": {
    "token_ids_sha256": "hex",
    "logits_checksum": "hex"
  }
}

Target: greater than 90% of drift incidents have complete capsule reconstruction.

Month 6–12: Governance Gates for Fleet Changes

Owners: Platform Engineering, Security/Supply Chain, Release Engineering

Governance Gate Requirements

Gate	Trigger	Required Action	Blocking
Driver Upgrade	New driver version on any node	Canary on golden suite; compare capsule diffs	Yes
Triton Cache Shared	Cache volume mounted cross-node	Verify env fingerprint match; sign artifacts	Yes
MIG Mixed Request	Pod requests multiple GPU types	Reject unless explicitly allowed by policy	Yes
NCCL Config Change	Topology or algorithm env var change	Run all-reduce microbench; compare to baseline	Yes
Container Image Update	New image with different CUDA/cuDNN	Full golden-suite regression	Yes

Target: zero post-upgrade drift incidents for deterministic-tier models.

Observability Primitives

Integration with OpenTelemetry GenAI

OpenTelemetry GenAI semantic conventions define spans and attributes for inference requests. The right approach is to extend this with capsule references rather than stuffing all hardware details into span attributes. Keep the spans lightweight. Let the capsule store carry the forensic depth.

Proposed signals:

gen_ai.execution.capsule_ref → content-addressed pointer (hash/URI) to capsule
gen_ai.execution.plan_hashes → small set of hashes (cuDNN/cuBLAS/Triton/NCCL)
system.gpu.uuid, system.gpu.mig_profile, system.gpu.driver_version → low-cardinality routing and debug fields

Loading diagram…

Supply Chain: SBOM and AIBOM

The same capsule principle extends to AI supply chain governance. Connect runtime dependencies (SBOM) and AI artifacts (AIBOM) to execution capsules. This creates a complete chain: model provenance → runtime environment → execution plan → output verification.

Relevant standards: NTIA SBOM minimum elements for baseline disclosure. SPDX 3.0.1 with AI and Dataset profiles. AIBOM as a first-class supply-chain element via SPDX extension. IETF draft EAT profile for AI agents referencing SBOM/AIBOM via attestation claims.

Cross-Vendor Determinism

One more thing that catches teams when they try to run the same model across AMD and NVIDIA GPUs.

AMD's hipBLAS documents that some functions may use atomic ops to increase performance, causing results to not be bit-wise reproducible. The backend defaults differ: rocBLAS may allow atomics by default while cuBLAS disallows them. This means the same PyTorch code running on AMD vs NVIDIA GPUs may have different determinism defaults.

Cross-Vendor Determinism Comparison

Property	NVIDIA (cuBLAS/cuDNN)	AMD (rocBLAS/MIOpen)
Atomics default	Disabled (deterministic)	Enabled (nondeterministic)
Same-arch bitwise repro	Guaranteed (with conditions)	Guaranteed (with conditions)
Cross-arch bitwise repro	NOT guaranteed	NOT guaranteed
Determinism env var	CUBLAS_WORKSPACE_CONFIG	ROCBLAS_LAYER (partial)
Multi-stream guarantee	NOT guaranteed	NOT guaranteed

If you are running a multi-vendor fleet, you need vendor-aware determinism policies. The execution capsule should capture which vendor and backend are active, and your drift alerting should account for the different defaults.

Closing

I started this article with a piece of assembly code and told you the code was clean.

The issue was never the code. It was the substrate. The silicon. The physics. The contracts between layers that nobody wrote down and nobody monitors.

That is the real lesson here. AI systems are the most complex software-hardware integration problem we have built at scale, and we are running them on infrastructure assumptions inherited from an era when "the hardware works" was a reasonable default.

It is not anymore.

If you take one thing from this article, let it be this: correctness in AI is not a model property. It is an infrastructure property. And infrastructure correctness only exists if you build it, verify it, and monitor it across every layer of the stack.

That starts with the assembly. It ends with the governance gates. And everything in between is where the silent collapses happen.

Primary References

This article is part of a series on deep systems architecture for AI. Related reading:

Kernel Dynamics: The Real Bottleneck of AI — prefill vs decode, memory walls, and GPU pipeline design
When Your LLM Trips the MMU — page faults, TLB shootdowns, and the virtual memory tax of AI inference
AI as a Worker, Not an Engineer — the hidden ceilings of AI coding agents
QSAF: Qorvex Security AI Framework — 63 controls across 9 domains for AI security

The Silent Collapse

Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

There is a class of failure in AI systems that does not announce itself.

No crash. No Xid. No stack trace. No alert. The system continues to serve. The metrics stay green. The model generates fluent, confident, plausible text. And the outputs are wrong.

Now. Before we go any further, I want to show you something.

asm

HMMA.16816.F16  R4, R0, R2, R4    ; D[frag] = A[frag] * B[frag] + C[frag]
 
; Register R0 holds an A-matrix fragment (FP16)
; FP16 layout: 1 sign | 5 exponent | 10 mantissa
; Value = (-1)^s × 2^(e-15) × (1 + m/1024)
;
; R0 current value:
;   0 01100 1000000000 = +1.0 × 2^(-3) = 0.125
;
; After a single bit-flip on bit 14 (MSB of exponent):
;   0 11100 1000000000 = +1.0 × 2^(13)  = 8192.0
;
; The matmul accumulates this into the output logits.
; An attention score that should have been near-zero now dominates softmax.
; The next token prediction shifts to an entirely different token.
; No error is raised. No exception. No log entry.
; The model confidently produces the wrong answer.

And if you can read it, and you looked at it carefully, and you did not spot the problem — let me put this very clearly.

The code is clean.

There is no bug. No typo. No off-by-one. No misaligned pointer. The instruction is correct. The register encoding is correct. The operation is correct.

The issue is way more complex than you can imagine.

That is what Silent Data Corruption looks like from the inside. And that is what this article is about.

Why AI systems fail differently

Two trends amplify this:

Loading diagram…

We didn't change the model. You are right. You changed the execution contract. And nobody wrote it down.

The Twelve Failure Modes

1. Silent Data Corruption That Evades Detection

I opened with this one for a reason. It is the most dangerous because it is invisible.

// Simplified attention score computation (one head, one query position)
// In CUDA, this maps to tensor core HMMA instructions
//
// score[i] = sum_d(Q[q][d] * K[i][d]) / sqrt(d_k)
//
// If SDC corrupts one K[i][d] value during the dot product:
//   - The corrupted score[i] dominates after softmax
//   - The value vector V[i] gets disproportionate weight
//   - The output representation shifts
//   - Downstream layers amplify the error
//   - The argmax over the vocabulary changes
//
// At temperature=0, a SINGLE corrupted attention score
// can deterministically produce the wrong next token.
 
float corrupted_score = 0.0f;
for (int d = 0; d < d_k; d++) {
    corrupted_score += Q[q][d] * K[i][d];  // <-- SDC injection point
}
corrupted_score /= sqrtf((float)d_k);
// After softmax, this corrupted score dominates
// The model confidently produces the wrong token

2. Memory-Error Recovery Side Effects

This one is subtle. The failure itself gets handled. The side effects of the handling are what hurt you.

plaintext

GPU Memory Recovery State Machine:
                                                    
  ┌──────────┐    ECC Error    ┌──────────────┐     
  │  Normal   │───────────────►│  Page Offline │     
  └──────────┘                 └──────┬───────┘     
                                      │              
                               Row Remap Pending     
                                      │              
                               ┌──────▼───────┐     
                               │  GPU Reset    │     
                               └──────┬───────┘     
                                      │              
                     ┌────────────────┼────────────┐
                     │                │            │
              ┌──────▼──────┐  ┌─────▼─────┐  ┌──▼──────────┐
              │  Remapped   │  │  Remap     │  │  Remap      │
              │  (Reclaim)  │  │  Pending   │  │  Failed     │
              └─────────────┘  └───────────┘  └─────────────┘
                                     │
                              Still degraded:
                           reduced workspace,
                          different algo selection

This is the one that makes me lose sleep.

// What the driver bug looks like from userspace:
// Process A launches GEMM kernel on stream 0
// Process B launches attention kernel on stream 1
// Driver internally shares launch state
// Process A's kernel receives corrupted launch parameters
//
// Driver internal (simplified):
// launch_state = acquire_shared_launch_slot();
// launch_state->grid_dim = user_grid;    // Thread B races here
// launch_state->block_dim = user_block;  // Partial write + context switch
// launch_state->params = user_params;    // Now contains mixed state
// dispatch_to_gpu(launch_state);         // Corrupted launch
//
// Result: kernel executes with wrong grid dimensions or wrong parameters
// Output tensor contains garbage in some regions, valid data in others
// No error is raised — the kernel "completed successfully"

4. Asynchronous Error Surfacing That Launders Root Cause

Loading diagram…

The correct pattern: sync fences at correctness-critical boundaries.

// The dangerous pattern in AI serving:
// Stream 0: attention kernel (faults silently)
// Stream 1: MLP kernel (reads attention output — already corrupted)
// Stream 1: softmax + argmax (produces valid-looking but wrong token)
// Only AFTER the response is sent does a sync boundary catch the error
 
// Correct pattern:
cudaLaunchKernel(attention_kernel, ...);
cudaStreamSynchronize(stream_0);  // Fence: catch errors before consuming
CHECK_CUDA_ERROR();               // Propagate immediately
cudaLaunchKernel(mlp_kernel, ...);

5. cuBLAS Multi-Stream Workspace Nondeterminism

Systems / Runtime

Why workspace selection changes math

Mitigation: CUBLAS_WORKSPACE_CONFIG=:4096:8 forces deterministic workspace selection. Cost: ~32KB reserved workspace per stream, minor perf regression.

python

# Demonstration: cuBLAS workspace nondeterminism
import torch, os
 
# WITHOUT workspace config: nondeterministic under multi-stream
result_a = torch.mm(Q, K.T)  # cuBLAS GEMM on stream 0
result_b = torch.mm(Q, K.T)  # cuBLAS GEMM on stream 1 (concurrent)
# result_a and result_b may differ at the ULP level
 
# WITH workspace config: deterministic workspace selection
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
# Now: same inputs → same workspace → same rounding → bitwise identical

6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift

Let me show you why from first principles. This is the kind of thing that kept me up at night twenty years ago, and it is the same physics now, just at a different scale.

asm

; GPU atomic add (simplified SASS — NVIDIA assembly)
; When multiple threads atomically add to the same address,
; the ORDER of additions is nondeterministic.
;
; FP addition is NOT associative:
;   (a + b) + c ≠ a + (b + c)  in floating point
;
; Example with FP16 values:
;   Thread 0: atomicAdd(&sum, 0.0001)
;   Thread 1: atomicAdd(&sum, 1000.0)
;   Thread 2: atomicAdd(&sum, 0.0001)
;
; Order A: (0.0001 + 1000.0) + 0.0001 = 1000.0  (0.0001 lost to rounding)
; Order B: (0.0001 + 0.0001) + 1000.0 = 1000.0  (0.0002 preserved then lost)
; Order C:  0.0001 + (1000.0 + 0.0001) = 1000.0  (different rounding path)
;
; ATOM.E.ADD.F16 R4, [R2], R0  ; Atomic FP16 add to global memory
; The execution order depends on warp scheduling — inherently nondeterministic

7. Allocator Fragmentation Forcing Algorithm/Precision Drift

Loading diagram…

8. PTX JIT and Compute-Cache Invalidation

plaintext

; PTX (intermediate) — portable across GPU architectures
; This gets JIT-compiled by the driver into actual SASS:
 
.visible .entry attention_kernel(
    .param .u64 param_Q,
    .param .u64 param_K,
    .param .u64 param_V
)
{
    .reg .f16x2 %h<8>;
    .reg .f32   %f<16>;
    
    ld.global.v2.f16 %h0, [%rd0];
    
    ; The driver's JIT compiler decides:
    ;   - Instruction scheduling order
    ;   - Register allocation
    ;   - Memory access coalescing patterns
    ;   - Whether to use FMA or separate MUL+ADD
    ;
    ; Different driver versions → different SASS → different rounding
    
    mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
        {%f0, %f1, %f2, %f3},
        {%h0, %h1, %h2, %h3},
        {%h4, %h5},
        {%f0, %f1, %f2, %f3};
}

9. Triton Kernel Cache Key Gaps

python

# Triton cache structure (simplified):
# $TRITON_CACHE_DIR/
#   {installation_hash}/
#     {source_hash}/
#       {backend_hash}/
#         {options_hash}/
#           kernel.cubin      ← Compiled GPU binary
#           kernel.json       ← Metadata
#
# Attack scenario:
# 1. Shared NFS cache across cluster
# 2. Node A compiles kernel with driver v550.54
# 3. Node B gets driver v550.90 (security patch changes codegen)
# 4. Node B loads cached cubin from Node A — WRONG BINARY
# 5. No error. No warning. Different numerical behavior.

10. CUDA Graph Address-Capture Corruption

// The dangerous pattern:
void serve_request() {
    // Record phase
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    
    float* temp = allocate_tensor(1024);  // Address: 0x7f3a00000000
    launch_attention_kernel(temp, ...);
    launch_mlp_kernel(temp, ...);
    
    cudaStreamEndCapture(stream, &graph);
    free_tensor(temp);  // temp freed! Address returned to pool
    
    // ... later, another allocation reuses 0x7f3a00000000 ...
    
    // Replay phase
    cudaGraphLaunch(graph_exec, stream);
    // Graph reads from 0x7f3a00000000 — now contains DIFFERENT data
    // Output is "valid" (no crash) but semantically wrong
    // Model produces fluent, confident, incorrect text
}

11. MIG "Undefined Device" Placement Heterogeneity

12. NCCL Topology/Algorithm Changes and FP Non-Associativity

Loading diagram…

The Deep Security Perspective

Silicon / Hardware

Row Hammer in HBM: disturbance errors in AI accelerator memory

Impact: High-density HBM stacks are susceptible to adjacent-row disturbance that can corrupt model weights in memory — bypassing all software-level integrity checks.

Consider the x86 equivalent for context. The GPU uses different instructions but the principle is identical:

asm

; x86 row-hammer pattern — repeated access to two addresses
; that map to the same bank but different rows
 
hammer_loop:
    mov rax, [row_A]        ; Access row A
    mov rbx, [row_B]        ; Access row B (different row, same bank)
    clflush [row_A]         ; Flush cache — force DRAM access
    clflush [row_B]         ; Flush cache — force DRAM access
    mfence                  ; Ensure flushes complete
    dec rcx
    jnz hammer_loop         ; Repeat millions of times
    
; After sufficient iterations, row C (between A and B) may have bit-flips
; In HBM, the TSV geometry and higher density may lower the threshold

The Zero Trust argument for GPU compute

In my Zero Trust work, the core principle is: never trust, always verify. This applies to network boundaries, identity, and data. I argue it must also apply to compute:

Never trust that the GPU computed correctly — verify with canary workloads and cross-node comparison
Never trust that the cached binary is valid — verify with environment fingerprints and signatures
Never trust that the memory is uncorrupted — verify with periodic weight hashing
Never trust that the driver is benign — verify with deterministic golden-set regression after every upgrade
Never trust that the execution plan is stable — verify with plan-hash tracking and drift alerting

Comparative Triage Matrix

Deep-Stack Failure Mode Triage Matrix

Failure Mode	Root Cause Layer	Detectability	Repro Difficulty	Impact Severity	Mitigation Cost
Silent Data Corruption (SDC)	Hardware / Firmware	Low	Hard	Critical	High
Memory Recovery Side Effects	Firmware / Driver / Runtime	Medium	Medium	High	Medium
Driver Kernel-Launch Corruption	Driver	Medium	Hard	Critical	Medium
Async Error Laundering	Driver / Runtime	Medium	Medium	High	Medium
cuBLAS Multi-Stream Workspace	Runtime / Kernel Library	Low	Medium	High	Medium
cuDNN Atomic Nondeterminism	Kernel Library / Hardware	Low	Medium	High	Medium
Allocator → Plan Drift	Runtime / Kernel Library	Low	Hard	High	Medium
PTX JIT Cache Invalidation	Driver / Compiler / Runtime	Low	Medium	High	Medium
Triton Cache-Key Gaps	Compiler / Runtime	Low	Hard	Critical	Medium–High
CUDA Graph Address Reuse	Runtime / Model	Low	Medium	Critical	Medium
MIG Undefined Device	Orchestration	Medium	Easy	High	Low–Medium
NCCL Topology Variability	Orchestration / Kernel Library	Medium	Medium	High	Medium

Priority ordering: SDC > Workspace/Allocator Drift > Compiler/Cache Drift. These three represent the highest impact with lowest detectability for most organizations.

The top three

These three failure modes should be the immediate priority for any organization running AI inference at scale.

SDC — silent wrong math at the silicon level. No error signal. No log entry. No alert. The model confidently produces incorrect output.

Compiler / Cache Drift — PTX JIT invalidation plus Triton cache-key gaps produce different kernels after fleet changes, and everyone misattributes it to "model randomness."

If you only have budget for three workstreams, start there.

Engineering Roadmap

Month 0–2: Establish Correctness Observability Baselines

Owners: GPU Platform/SRE, ML Serving Runtime, Observability

Month 0–2 Deliverables

Deliverable	Owner	Success Metric
GPU health ingestion (Xids, ECC, page retirement, row-remap, DCGM)	SRE	100% of inference GPUs reporting health signals
Determinism tier config per service (stream policy, workspace config, cuDNN flags)	ML Serving	All Tier-1 models have determinism spec
GPU identity in request telemetry (UUID, MIG profile, driver version)	Observability	> 95% of requests have GPU identity attached

Month 2–4: Golden Canaries and Drift Attribution

Owners: ML Quality, Serving Runtime, SRE

Target: drift rate per 1M requests baselined, then 10× reduction.

Month 4–6: Deterministic Execution Capsules

Owners: Serving Runtime, Observability

json

{
  "capsule_version": "v1",
  "request": {
    "request_id": "uuid",
    "seed": 0,
    "decoding": { "temperature": 0.0, "top_p": 1.0, "max_tokens": 256 }
  },
  "model": {
    "model_id": "org/model@artifact",
    "weights_sha256": "hex",
    "tokenizer_sha256": "hex"
  },
  "hardware": {
    "gpu_vendor": "nvidia",
    "gpu_uuid": "GPU-xxxx-xxxx",
    "arch": "hopper",
    "sm_count": 132,
    "mig_profile": null,
    "driver_version": "550.54.15",
    "hbm_ecc_state": "enabled",
    "row_remap_pending": false,
    "retired_pages": { "sbe": 0, "dbe": 0 }
  },
  "runtime": {
    "framework": "pytorch",
    "cuda_toolkit": "12.4",
    "cudnn": "9.1.0",
    "cublas_workspace_config": ":4096:8",
    "allocator": {
      "backend": "native",
      "expandable_segments": true,
      "max_split_size_mb": 512
    }
  },
  "plans": {
    "cudnn_engine_plan_hash": "hex",
    "cublas_algo_fingerprint": "hex",
    "triton_cache_key": "base32",
    "triton_env_fingerprint": "hex",
    "cuda_graph_id": "hash"
  },
  "distributed": {
    "nccl_algo": "Ring",
    "nccl_protocol": "Simple",
    "topology_fingerprint": "hex"
  },
  "outputs": {
    "token_ids_sha256": "hex",
    "logits_checksum": "hex"
  }
}

Target: greater than 90% of drift incidents have complete capsule reconstruction.

Month 6–12: Governance Gates for Fleet Changes

Owners: Platform Engineering, Security/Supply Chain, Release Engineering

Governance Gate Requirements

Gate	Trigger	Required Action	Blocking
Driver Upgrade	New driver version on any node	Canary on golden suite; compare capsule diffs	Yes
Triton Cache Shared	Cache volume mounted cross-node	Verify env fingerprint match; sign artifacts	Yes
MIG Mixed Request	Pod requests multiple GPU types	Reject unless explicitly allowed by policy	Yes
NCCL Config Change	Topology or algorithm env var change	Run all-reduce microbench; compare to baseline	Yes
Container Image Update	New image with different CUDA/cuDNN	Full golden-suite regression	Yes

Target: zero post-upgrade drift incidents for deterministic-tier models.

Observability Primitives

Integration with OpenTelemetry GenAI

Proposed signals:

gen_ai.execution.capsule_ref → content-addressed pointer (hash/URI) to capsule
gen_ai.execution.plan_hashes → small set of hashes (cuDNN/cuBLAS/Triton/NCCL)
system.gpu.uuid, system.gpu.mig_profile, system.gpu.driver_version → low-cardinality routing and debug fields

Loading diagram…

Supply Chain: SBOM and AIBOM

Cross-Vendor Determinism

One more thing that catches teams when they try to run the same model across AMD and NVIDIA GPUs.

Cross-Vendor Determinism Comparison

Property	NVIDIA (cuBLAS/cuDNN)	AMD (rocBLAS/MIOpen)
Atomics default	Disabled (deterministic)	Enabled (nondeterministic)
Same-arch bitwise repro	Guaranteed (with conditions)	Guaranteed (with conditions)
Cross-arch bitwise repro	NOT guaranteed	NOT guaranteed
Determinism env var	CUBLAS_WORKSPACE_CONFIG	ROCBLAS_LAYER (partial)
Multi-stream guarantee	NOT guaranteed	NOT guaranteed

Closing

I started this article with a piece of assembly code and told you the code was clean.

The issue was never the code. It was the substrate. The silicon. The physics. The contracts between layers that nobody wrote down and nobody monitors.

It is not anymore.

That starts with the assembly. It ends with the governance gates. And everything in between is where the silent collapses happen.

Primary References

This article is part of a series on deep systems architecture for AI. Related reading:

Kernel Dynamics: The Real Bottleneck of AI — prefill vs decode, memory walls, and GPU pipeline design
When Your LLM Trips the MMU — page faults, TLB shootdowns, and the virtual memory tax of AI inference
AI as a Worker, Not an Engineer — the hidden ceilings of AI coding agents
QSAF: Qorvex Security AI Framework — 63 controls across 9 domains for AI security

The Silent Collapse: Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

The Silent Collapse

Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

Why AI systems fail differently

The Twelve Failure Modes

1. Silent Data Corruption That Evades Detection

2. Memory-Error Recovery Side Effects

3. Driver-Level Kernel-Launch Corruption Under Multi-Process Sharing

4. Asynchronous Error Surfacing That Launders Root Cause

5. cuBLAS Multi-Stream Workspace Nondeterminism

6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift

7. Allocator Fragmentation Forcing Algorithm/Precision Drift

8. PTX JIT and Compute-Cache Invalidation

9. Triton Kernel Cache Key Gaps

10. CUDA Graph Address-Capture Corruption

11. MIG "Undefined Device" Placement Heterogeneity

12. NCCL Topology/Algorithm Changes and FP Non-Associativity

The Deep Security Perspective

The Zero Trust argument for GPU compute

Comparative Triage Matrix

The top three

Engineering Roadmap

Month 0–2: Establish Correctness Observability Baselines

Month 2–4: Golden Canaries and Drift Attribution

Month 4–6: Deterministic Execution Capsules

Month 6–12: Governance Gates for Fleet Changes

Observability Primitives

Integration with OpenTelemetry GenAI

Supply Chain: SBOM and AIBOM

Cross-Vendor Determinism

Closing

Primary References

More Articles

When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference

Kernel Dynamics: The Real Bottleneck of AI

The Silent Collapse: Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

The Silent Collapse

Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace

Why AI systems fail differently

The Twelve Failure Modes

1. Silent Data Corruption That Evades Detection

2. Memory-Error Recovery Side Effects

3. Driver-Level Kernel-Launch Corruption Under Multi-Process Sharing

4. Asynchronous Error Surfacing That Launders Root Cause

5. cuBLAS Multi-Stream Workspace Nondeterminism

6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift

7. Allocator Fragmentation Forcing Algorithm/Precision Drift

8. PTX JIT and Compute-Cache Invalidation

9. Triton Kernel Cache Key Gaps

10. CUDA Graph Address-Capture Corruption

11. MIG "Undefined Device" Placement Heterogeneity

12. NCCL Topology/Algorithm Changes and FP Non-Associativity

The Deep Security Perspective

The Zero Trust argument for GPU compute

Comparative Triage Matrix

The top three

Engineering Roadmap

Month 0–2: Establish Correctness Observability Baselines

Month 2–4: Golden Canaries and Drift Attribution

Month 4–6: Deterministic Execution Capsules

Month 6–12: Governance Gates for Fleet Changes

Observability Primitives

Integration with OpenTelemetry GenAI

Supply Chain: SBOM and AIBOM

Cross-Vendor Determinism

Closing

Primary References

More Articles

When Your LLM Trips the MMU: Page Faults, TLB Shootdowns, and the Hidden Virtual-Memory Tax of AI Inference

Kernel Dynamics: The Real Bottleneck of AI