The Silent Collapse
Deep-Stack Hardware–Software Failure Modes That Corrupt AI Systems Without a Trace
There is a class of failure in AI systems that does not announce itself.
No crash. No Xid. No stack trace. No alert. The system continues to serve. The metrics stay green. The model generates fluent, confident, plausible text. And the outputs are wrong.
Not wrong in the way a hallucination is wrong — obviously, detectably, sometimes amusingly wrong. Wrong in the way a corrupted gradient is wrong: silently, systematically, in a direction you cannot distinguish from legitimate nondeterminism until you have already shipped it to production and the damage compounds across thousands of requests.
I have spent over twenty years building and debugging complex systems including distributed infrastructure. I have traced failures from x86 microcode errata through kernel page-table corruption to CUDA driver bugs that only manifest under multi-process GPU sharing. My published work on Microsoft Tech Community — The Hidden Memory Architecture of LLMs, AI Didn't Break Your Production — Your Architecture Did, and my Zero Trust architectural guidance — keeps returning to the same thesis: when AI fails in production, it is rarely because the model is weak. It is because the infrastructure contract was never specified, never verified, and never monitored.
My research and analysis on deep systems architecture, AI & Deep Learning, GPU memory hierarchies, and AI infrastructure failure modes has been referenced by engineers and the broader systems community. Not because these topics are novel in isolation, but because the complete failure surface across hardware, firmware, driver, runtime, compiler, and orchestration rarely gets assembled in one coherent framework. Most organizations only do it after the incident.
Now. Before we go any further, I want to show you something.
Now, some of you are probably looking at that and wondering what on earth it is. That is fair. At my age, we spent a lot of time with code like this. It is called Assembly — specifically, this is NVIDIA Hopper SASS, the actual machine instructions that run on the tensor cores inside your GPU when your transformer layer does a matrix multiply.
And if you can read it, and you looked at it carefully, and you did not spot the problem — let me put this very clearly.
The code is clean.
There is no bug. No typo. No off-by-one. No misaligned pointer. The instruction is correct. The register encoding is correct. The operation is correct.
The issue is way more complex than you can imagine.
A single bit-flip — not in the code, not in the binary, but in the physical register holding the value at runtime — turned 0.125 into 8192.0. And from that point forward, every downstream computation is contaminated. The attention distribution shifts. The argmax changes. The model produces a different token. And nothing in your monitoring, your logging, your alerting, or your metrics will tell you it happened.
That is what Silent Data Corruption looks like from the inside. And that is what this article is about.
The goal of this article is precise: give platform teams a complete, prioritized catalogue of failure modes that can corrupt AI model outputs without raising alerts, along with concrete detection recipes and architectural mitigations.
This is not a debug guide. It is a governance framework for AI infrastructure correctness — built from the same Zero Trust principles I have advocated across my Microsoft publications: never assume correctness; always verify; instrument every trust boundary.
Why AI systems fail differently
Traditional distributed systems fail loudly. A database returns an error. A service returns a 500. A network partition triggers a timeout. The failure surface is well-understood, and decades of engineering have produced mature detection and recovery mechanisms.
AI systems fail silently. And they fail silently because correctness is an end-to-end property across hardware reliability, drivers, kernel libraries, allocators, compilers, and orchestration — and no single layer owns it.
Two trends amplify this:
Silent Data Corruption (SDC) — hardware faults that evade detection yet alter numerical results. The Open Compute Project SDC whitepaper explicitly calls out the "needle in a haystack" detection challenge and the gap between low-level fault metrics and AI correctness metrics at scale.
Performance-driven dynamism — autotuners, algorithm heuristics, caching allocators, and compilation caches that legitimately change execution plans over time. These are features, not bugs. But they create a system where the same model, same input, same hardware can produce different outputs depending on timing, memory pressure, concurrency, and cached state.
That diagram is the failure surface this article maps. Every arrow is a real failure mode I have either seen, debugged, or found documented in primary vendor sources. Twelve of them, across five layers. Let me walk you through each one.
We didn't change the model. You are right. You changed the execution contract. And nobody wrote it down.
The Twelve Failure Modes
1. Silent Data Corruption That Evades Detection
I opened with this one for a reason. It is the most dangerous because it is invisible.
SDC occurs when hardware faults produce incorrect computational results without triggering hardware error detection. The Open Compute Project whitepaper highlights that AI workloads can mask these faults, making detection difficult at fleet scale, and emphasizes the mismatch between hardware metrics (FIT rates, Architectural Vulnerability Factor) and AI correctness metrics (accuracy, loss, token-level agreement).
Most organizations treat "GPU error" as synonymous with explicit failures: Xid events, ECC errors, crashes. SDC breaks that assumption. It can look like benign nondeterminism unless you run canary invariants or cross-checks.
Here is how it propagates through a transformer forward pass. The critical part: a single corrupted multiply-accumulate in the attention score computation can shift the entire probability distribution.
This is not hypothetical. At fleet scale — thousands of GPUs running 24/7 — the statistical expectation of silent bit-flips is non-zero. Google's published research on silent data corruption documents that SDC occurs at measurable rates in production data centers, and that SDC events are not uniformly distributed. Some silicon lots and some operating conditions produce significantly higher rates.
How to detect it: Run a deterministic "golden micro-batch." Fix seeds, disable algorithm benchmarking, enforce deterministic cuDNN algorithms. Store checksums of intermediate tensors and logits per golden run. Alert on deviations outside a tiny tolerance when the execution capsule fingerprint matches. Add "shadow execution" for 0.1–1% of traffic: rerun on a different GPU/node and compare logits distance.
Symptoms: Temperature=0 inference occasionally yields different tokens on the same prompt. Training shows rare loss spikes that nobody can attribute to data or learning rate. Convergence to a different optimum across otherwise-identical runs.
2. Memory-Error Recovery Side Effects
This one is subtle. The failure itself gets handled. The side effects of the handling are what hurt you.
On uncorrectable contained ECC errors, the NVIDIA driver terminates the affected application, then dynamic page offlining marks the faulty pages unusable. Later, row remapping can remap the faulty row in hardware after a GPU reset, potentially reclaiming those offlined pages.
Teams look for "job failed." They rarely track the follow-on effects on memory shape. Page offlining and row-remap state can alter available memory, allocator behavior, or cause pending remediation that requires a reset — feeding into algorithm and plan selection drift.
Here is the chain that bites you: reduced workspace → cuDNN selects a different algorithm → that algorithm uses FP16 accumulation instead of FP32 → logits differ by enough to flip tokens on borderline cases. This is documented in cuDNN's notes on numerical accuracy varying by algorithm based on workspace availability.
How to detect it: Monitor PAGE_RETIREMENT and row-remap pending/failure via nvidia-smi or NVML. Correlate plan drift metrics (cuDNN/cuBLAS plan hashes) with page-retirement and row-remap counters. If you see plan changes on a GPU that recently had memory remediation, that is your signal.
3. Driver-Level Kernel-Launch Corruption Under Multi-Process Sharing
This is the one that makes me lose sleep.
NVIDIA Data Center GPU Driver release notes document a fixed issue: "potential corruption when launching kernels on H100 GPUs," more likely when the GPU is shared between multiple processes, manifesting as Xid 13 errors. When corruption is concurrency-sensitive, it gets misdiagnosed as model nondeterminism, user-kernel bugs, or framework issues.
In a multi-tenant GPU environment — MPS-enabled or time-sliced MIG — this means that one tenant's workload can corrupt another tenant's results. This is not a side-channel. It is a direct correctness violation. From a Zero Trust perspective, this demands process-level isolation for correctness-critical workloads.
How to detect it: Log driver version + GPU model + multiplexing mode as part of every request capsule. Alert on output drift correlated with concurrency. Reproduce on a pinned driver: run two processes saturating GEMM and attention kernels concurrently, compare outputs to single-process baseline, and watch for Xids.
4. Asynchronous Error Surfacing That Launders Root Cause
CUDA kernel launches are asynchronous. Errors are reported at later synchronization points — cudaMemcpy, cudaDeviceSynchronize, sometimes just a random API call that happens to sync. Even benign calls may return error codes from previous asynchronous launches, per CUDA runtime API documentation.
In inference pipelines with multiple CUDA streams, graph capture, and batched execution, the "sync gap" between a faulty kernel and the error report can span dozens of operations. If the pipeline consumes outputs before hitting a sync boundary, corrupted logits are shipped to the client.
The correct pattern: sync fences at correctness-critical boundaries.
5. cuBLAS Multi-Stream Workspace Nondeterminism
cuBLAS guarantees bitwise reproducibility under specific conditions, but explicitly warns that the guarantee does not hold when multiple CUDA streams are active. Nondeterminism arises from internal workspace selection optimizations.
This matters because the transformer's entire computation is a chain of matrix multiplications. If different workspace selections produce different floating-point rounding in the attention computation, the softmax distribution shifts, and the argmax can change. "Same input, same GPU, different output" — and nobody changed anything.
6. cuDNN Atomic-Based Nondeterminism and Cross-Architecture Drift
cuDNN states most routines are bitwise reproducible on the same architecture, but lists exceptions that are nondeterministic because they use atomic operations introducing "truly random floating point rounding errors." Across architectures, cuDNN routines do not guarantee bitwise reproducibility.
Let me show you why from first principles. This is the kind of thing that kept me up at night twenty years ago, and it is the same physics now, just at a different scale.
This is a fundamental mathematical property, not a bug. Oak Ridge National Laboratory's SC24 work documents that deep learning sensitivity to floating-point non-associativity can be "extreme," impacting reproducibility and certification.
The practical consequence: cross-region failover is brittle. Same request, different GPU generation, different completion. Teams set seeds and think they are done. They are not. They need to track which exact cuDNN algorithms were selected and understand that "deterministic selection" is distinct from "deterministic algorithm."
7. Allocator Fragmentation Forcing Algorithm/Precision Drift
This one is insidious because it correlates with uptime and load patterns, not code changes. Same model, same input, same GPU — different output after hours of production traffic because memory fragmentation changed the algorithm selection.
Framework allocators fragment GPU memory into "slivers" as batch sizes fluctuate. PyTorch documents how this pattern can lead to unrecoverable fragmentation without mitigation (e.g., expandable segments). Algorithm selection in cuDNN depends on available workspace — and cuDNN explicitly notes that numerical accuracy varies by algorithm based on whether extra workspace enables FP32 accumulation vs FP16.
How to detect it: Create memory pressure by allocating large tensors. Run a conv/attention op. Release pressure and repeat. Log selected algorithms and workspace sizes and compare. If they differ, your memory state is influencing your math.
8. PTX JIT and Compute-Cache Invalidation
CUDA fat binaries may include PTX — NVIDIA's intermediate representation. If the binary for the current GPU architecture is not present, the driver JIT-compiles PTX into SASS (the actual machine code). The driver caches the result. And here is the part everyone misses: the compute cache is automatically invalidated when the driver is upgraded.
Teams freeze container images but allow host driver updates, assuming "container immutability implies execution immutability." PTX JIT breaks that assumption completely. The container is immutable. The driver is not. The generated machine code changes.
Symptoms: "same container, slightly different outputs" after a driver upgrade. Cold-start regressions when JIT cache is cold. Mitigation: ship SASS for target architectures to reduce PTX reliance. Pin driver versions for determinism tiers. Treat driver upgrades like model upgrades — canary with deterministic golden sets.
9. Triton Kernel Cache Key Gaps
Triton's cache key derives from installation hash, source hash, backend hash, options hash, and selected environment variables. A recent Intel XPU backend issue warns that cache keys may miss backend/target invalidation factors (driver, compiler, environment), leading to incorrect cache reuse, nondeterministic behavior, and subtle correctness bugs. It also flags nondeterministic str(options) serialization risks.
From a supply-chain security standpoint, this is a real attack surface. Triton's kernel cache is a pre-compiled binary artifact that gets loaded and executed on the GPU without re-verification. If the cache is shared across nodes — a common optimization — a poisoned or stale cache entry can affect every node in the cluster.
10. CUDA Graph Address-Capture Corruption
CUDA graphs replay operations using the exact memory addresses captured during recording. If tensors are deallocated before replay, the graph accesses freed memory, causing corruption. Frameworks introduce graph-specific allocator strategies (private pools, checkpointing) that can still be misused.
Arbitrary, non-local corruption. You can flip a single layer output and get plausible but incorrect completions. The model does not know it is wrong. The serving framework does not know it is wrong. The user does not know it is wrong.
11. MIG "Undefined Device" Placement Heterogeneity
In MIG "mixed strategy," NVIDIA's Kubernetes guidance warns that if a container requests more than one device type (e.g., nvidia.com/gpu plus a MIG resource), "the device received is undefined" in default setups. Different SM counts, different memory partitions → different kernel choices and different numeric behavior.
Teams assume resource requests deterministically map to hardware. The undefined mapping sits in orchestration docs, not ML postmortems. I have seen output drift that perfectly correlated with placement — not model version, not code changes, not data. Placement.
How to detect it: Deploy two pods with mixed resource requests. Log actual device UUID and MIG profile. Compare output hashes for the same prompt. Add admission control policies rejecting multi-type GPU requests unless explicitly allowed.
12. NCCL Topology/Algorithm Changes and FP Non-Associativity
NCCL selects algorithms — Ring, Tree, and others — based on topology and configuration. Floating-point addition is non-associative. Reduction order changes alter outputs. NCCL has extensive environment variables controlling algorithm, protocol, and topology. Most teams never pin them.
Teams attribute training divergence to optimizer randomness rather than collective order and topology drift. They rarely consider that NIC selection and ring construction can change between runs, between restarts, sometimes between steps if the fabric is congested.
Mitigation: Pin NCCL algorithm/protocol/topology. Record NCCL config and topology fingerprint per run. Add topology-aware determinism tiering — repro tiers that require stable rings/trees and deterministic reductions.
The Deep Security Perspective
I want to go deeper into the security surface, because the failure modes above are not just reliability concerns. They intersect with hardware security in ways that should concern anyone running AI systems that make consequential decisions.
Consider the x86 equivalent for context. The GPU uses different instructions but the principle is identical:
This is why I advocate for cryptographic weight verification at runtime — not just at load time. Hash the weight tensors periodically and compare against known-good values. It is expensive. It is the only reliable defense against silent weight corruption, whether from hardware faults, row-hammer, or supply-chain tampering.
The Zero Trust argument for GPU compute
In my Zero Trust work, the core principle is: never trust, always verify. This applies to network boundaries, identity, and data. I argue it must also apply to compute:
- Never trust that the GPU computed correctly — verify with canary workloads and cross-node comparison
- Never trust that the cached binary is valid — verify with environment fingerprints and signatures
- Never trust that the memory is uncorrupted — verify with periodic weight hashing
- Never trust that the driver is benign — verify with deterministic golden-set regression after every upgrade
- Never trust that the execution plan is stable — verify with plan-hash tracking and drift alerting
This is not paranoia. This is the engineering discipline required when a silent one-bit error in a GPU register can change a medical diagnosis, a financial recommendation, or a safety-critical decision.
Comparative Triage Matrix
| Failure Mode | Root Cause Layer | Detectability | Repro Difficulty | Impact Severity | Mitigation Cost |
|---|---|---|---|---|---|
| Silent Data Corruption (SDC) | Hardware / Firmware | Low | Hard | Critical | High |
| Memory Recovery Side Effects | Firmware / Driver / Runtime | Medium | Medium | High | Medium |
| Driver Kernel-Launch Corruption | Driver | Medium | Hard | Critical | Medium |
| Async Error Laundering | Driver / Runtime | Medium | Medium | High | Medium |
| cuBLAS Multi-Stream Workspace | Runtime / Kernel Library | Low | Medium | High | Medium |
| cuDNN Atomic Nondeterminism | Kernel Library / Hardware | Low | Medium | High | Medium |
| Allocator → Plan Drift | Runtime / Kernel Library | Low | Hard | High | Medium |
| PTX JIT Cache Invalidation | Driver / Compiler / Runtime | Low | Medium | High | Medium |
| Triton Cache-Key Gaps | Compiler / Runtime | Low | Hard | Critical | Medium–High |
| CUDA Graph Address Reuse | Runtime / Model | Low | Medium | Critical | Medium |
| MIG Undefined Device | Orchestration | Medium | Easy | High | Low–Medium |
| NCCL Topology Variability | Orchestration / Kernel Library | Medium | Medium | High | Medium |
Priority ordering: SDC > Workspace/Allocator Drift > Compiler/Cache Drift. These three represent the highest impact with lowest detectability for most organizations.
The top three
These three failure modes should be the immediate priority for any organization running AI inference at scale.
SDC — silent wrong math at the silicon level. No error signal. No log entry. No alert. The model confidently produces incorrect output.
Workspace / Allocator-Driven Drift — cuBLAS multi-stream workspace selection plus fragmentation-driven algorithm changes create systematic, placement-dependent semantic divergence with no faults.
Compiler / Cache Drift — PTX JIT invalidation plus Triton cache-key gaps produce different kernels after fleet changes, and everyone misattributes it to "model randomness."
If you only have budget for three workstreams, start there.
Engineering Roadmap
Month 0–2: Establish Correctness Observability Baselines
Owners: GPU Platform/SRE, ML Serving Runtime, Observability
| Deliverable | Owner | Success Metric |
|---|---|---|
| GPU health ingestion (Xids, ECC, page retirement, row-remap, DCGM) | SRE | 100% of inference GPUs reporting health signals |
| Determinism tier config per service (stream policy, workspace config, cuDNN flags) | ML Serving | All Tier-1 models have determinism spec |
| GPU identity in request telemetry (UUID, MIG profile, driver version) | Observability | > 95% of requests have GPU identity attached |
Month 2–4: Golden Canaries and Drift Attribution
Owners: ML Quality, Serving Runtime, SRE
Deploy a golden prompt suite at temperature=0 with stable output signatures — token IDs plus logits checksum. Run shadow execution for a small traffic slice. Compare outputs and quarantine nodes with repeated divergence.
Target: drift rate per 1M requests baselined, then 10× reduction.
Month 4–6: Deterministic Execution Capsules
Owners: Serving Runtime, Observability
The execution capsule is the core new primitive. It makes "what ran" reconstructible. Everything a post-mortem needs to explain why two identical requests produced different outputs, captured at request time, not after the incident.
Target: greater than 90% of drift incidents have complete capsule reconstruction.
Month 6–12: Governance Gates for Fleet Changes
Owners: Platform Engineering, Security/Supply Chain, Release Engineering
| Gate | Trigger | Required Action | Blocking |
|---|---|---|---|
| Driver Upgrade | New driver version on any node | Canary on golden suite; compare capsule diffs | Yes |
| Triton Cache Shared | Cache volume mounted cross-node | Verify env fingerprint match; sign artifacts | Yes |
| MIG Mixed Request | Pod requests multiple GPU types | Reject unless explicitly allowed by policy | Yes |
| NCCL Config Change | Topology or algorithm env var change | Run all-reduce microbench; compare to baseline | Yes |
| Container Image Update | New image with different CUDA/cuDNN | Full golden-suite regression | Yes |
Target: zero post-upgrade drift incidents for deterministic-tier models.
Observability Primitives
Integration with OpenTelemetry GenAI
OpenTelemetry GenAI semantic conventions define spans and attributes for inference requests. The right approach is to extend this with capsule references rather than stuffing all hardware details into span attributes. Keep the spans lightweight. Let the capsule store carry the forensic depth.
Proposed signals:
gen_ai.execution.capsule_ref→ content-addressed pointer (hash/URI) to capsulegen_ai.execution.plan_hashes→ small set of hashes (cuDNN/cuBLAS/Triton/NCCL)system.gpu.uuid,system.gpu.mig_profile,system.gpu.driver_version→ low-cardinality routing and debug fields
Supply Chain: SBOM and AIBOM
The same capsule principle extends to AI supply chain governance. Connect runtime dependencies (SBOM) and AI artifacts (AIBOM) to execution capsules. This creates a complete chain: model provenance → runtime environment → execution plan → output verification.
Relevant standards: NTIA SBOM minimum elements for baseline disclosure. SPDX 3.0.1 with AI and Dataset profiles. AIBOM as a first-class supply-chain element via SPDX extension. IETF draft EAT profile for AI agents referencing SBOM/AIBOM via attestation claims.
Cross-Vendor Determinism
One more thing that catches teams when they try to run the same model across AMD and NVIDIA GPUs.
AMD's hipBLAS documents that some functions may use atomic ops to increase performance, causing results to not be bit-wise reproducible. The backend defaults differ: rocBLAS may allow atomics by default while cuBLAS disallows them. This means the same PyTorch code running on AMD vs NVIDIA GPUs may have different determinism defaults.
| Property | NVIDIA (cuBLAS/cuDNN) | AMD (rocBLAS/MIOpen) |
|---|---|---|
| Atomics default | Disabled (deterministic) | Enabled (nondeterministic) |
| Same-arch bitwise repro | Guaranteed (with conditions) | Guaranteed (with conditions) |
| Cross-arch bitwise repro | NOT guaranteed | NOT guaranteed |
| Determinism env var | CUBLAS_WORKSPACE_CONFIG | ROCBLAS_LAYER (partial) |
| Multi-stream guarantee | NOT guaranteed | NOT guaranteed |
If you are running a multi-vendor fleet, you need vendor-aware determinism policies. The execution capsule should capture which vendor and backend are active, and your drift alerting should account for the different defaults.
Closing
I started this article with a piece of assembly code and told you the code was clean.
The issue was never the code. It was the substrate. The silicon. The physics. The contracts between layers that nobody wrote down and nobody monitors.
That is the real lesson here. AI systems are the most complex software-hardware integration problem we have built at scale, and we are running them on infrastructure assumptions inherited from an era when "the hardware works" was a reasonable default.
It is not anymore.
If you take one thing from this article, let it be this: correctness in AI is not a model property. It is an infrastructure property. And infrastructure correctness only exists if you build it, verify it, and monitor it across every layer of the stack.
That starts with the assembly. It ends with the governance gates. And everything in between is where the silent collapses happen.
Primary References
This article is part of a series on deep systems architecture for AI. Related reading:
- Kernel Dynamics: The Real Bottleneck of AI — prefill vs decode, memory walls, and GPU pipeline design
- When Your LLM Trips the MMU — page faults, TLB shootdowns, and the virtual memory tax of AI inference
- AI as a Worker, Not an Engineer — the hidden ceilings of AI coding agents
- QSAF: Qorvex Security AI Framework — 63 controls across 9 domains for AI security



