There is a question that surfaces every few months, louder each time, and it always arrives in the same shape.
Will AI replace software engineers?
I have spent over twenty years building systems that survive production. I have seen abstraction layers rise and fall. I have watched compute shift from CPUs to GPGPUs to dedicated tensor accelerators. I have spent years inside the memory hierarchy, the scheduler, the allocator, the places where promises meet physics. And I have shipped AI systems at enterprise scale where the failure mode was never "the model got dumber." It was always something else.
So here is my answer, delivered not as opinion, but as an engineering position grounded in evidence.
This article will take you through seven layers of that gap. Each one exposes a limitation that most AI discourse either ignores or hand-waves away.
- The benchmark illusion — why SWE-bench scores overstate capability
- The architecture gap — what engineering requires beyond code production
- The hardware ceiling — physical limits that constrain what agents can become
- The governance void — why accountability cannot be automated
- The illusion of understanding — why pattern matching is not reasoning
- The deeper ceilings — what forty years of peer-reviewed engineering science already proved
- The extraordinary tool — and the discipline it demands
If you have read my Microsoft publications — The Hidden Memory Architecture of LLMs and AI Didn't Break Your Production — Your Architecture Did — you already know the lens. I keep returning to the same truth:
When AI fails in production, it usually isn't because the model is weak. It is because the architecture around it was never built for real conditions. — Hazem Ali
This article takes that lens and aims it at the question everyone is asking. Not with comfort. With engineering rigor.
Part I: The Benchmark Illusion
What SWE-bench actually measures
SWE-bench was introduced as an evaluation framework built from 2,294 software engineering problems drawn from real GitHub issues and pull requests across 12 popular Python repositories. The model is tasked with editing a repository to resolve the issue.
The evaluation protocol is straightforward: apply a generated patch to real repositories and run tests inside a containerized Docker environment. SWE-bench Verified is a 500-instance human-validated subset with two explicit test buckets:
- FAIL_TO_PASS: tests that fail before the fix and must pass after
- PASS_TO_PASS: tests that pass before and must still pass after
Both must pass for a solution to count as resolved.
That sounds rigorous. It is not.
The gap between "passes the test oracle" and "is an engineering outcome" is enormous. And the evidence shows that even the test oracle itself is unreliable.
The contamination problem
Two independent diagnostic studies argue that SWE-bench Verified may partially measure training data overlap rather than generalizable skill:
"The SWE-Bench Illusion" reports that models can identify buggy file paths from issue text alone at very high accuracy on SWE-bench Verified — with materially lower performance outside the benchmark. That pattern is consistent with memorization, not reasoning.
"Does SWE-Bench-Verified Test Agent Ability or Model Memory?" reports models performing 3× better on SWE-bench Verified than on comparable benchmarks under minimal-context setups that "should be logically impossible" — and interprets this as consistent with training overlap.
SWE-bench+ adds empirical weight: in a manual screening of "successful" patches, a large fraction involved solution leakage (hints in issue text and comments) and weak tests. Over 94% of issues predate common model training cutoff dates, creating systematic data leakage risk.
This is a textbook instance of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. SWE-bench resolved rates have become the primary marketing metric for AI coding agents. The predictable consequence is that agent scaffolds, training pipelines, and even model selection are optimized against the benchmark distribution — not against the distribution of real engineering work. Campbell's Law extends the diagnosis: the more a quantitative indicator is used for social decision-making, the more subject it becomes to corruption pressures and the more apt it is to distort the processes it is intended to monitor. The resolved-rate arms race between agent vendors is not incidental to the benchmark's erosion — it is the mechanism by which that erosion occurs.
Repo-state leaks: when "solving" becomes "retrieving"
A benchmark-maintainer GitHub issue documents "multiple loopholes" where agents can access future repository state — including a concrete trajectory where an agent uses git log --all to reveal a future commit diff that directly fixes the issue.
This is not a minor hygiene issue. It turns "issue-solving" into "solution retrieval." And from an organizational risk perspective, it reflects a broader reality I keep emphasizing:
Agents are tool-using systems. If they can access hidden state — future commits, internal tickets, private branches — they may produce correct patches for the wrong reason and create false confidence about their general ability. — Hazem Ali
The weak oracle: patches that "pass" but are wrong
Multiple audits show that test-based validation overcounts correctness:
- A SWE-bench issue reports that evaluation collects and executes only test files changed in the corresponding PR. Some LLM-generated patches pass FAIL_TO_PASS and PASS_TO_PASS but fail other developer tests the oracle patch passes.
- UTBoost reveals that evaluation parsing can miss test cases, and fixing these issues uncovered hundreds of erroneous patches previously labeled as passing.
- "Are Solved Issues Solved Correctly?" reports that 7.8% of patches count as "correct" while failing developer-written test suites, and 29.6% of plausible patches show behavioral differences from ground truth.
Let me put that in architect terms. Nearly one in three patches that look correct under the oracle behave differently from the intended fix. In production, that is not a "minor discrepancy." That is a regression pipeline.
The success that proves the point
The SWE-agent paper contains a qualitative success case that inadvertently proves the "worker, not engineer" argument. An agent identifies a bytes-to-string conversion issue, patches the code, validates the fix with a reproduction script, and passes all unit tests.
Engineering win? Not quite. The gold patch uses an existing utility function that does the same thing. The agent reinvented the behavior instead of reusing the project's own abstraction.
This is textbook "locally correct, architecturally wrong." The fix works. It passes tests. It is also the kind of code that creates maintenance debt, API inconsistency, and duplication that compounds over years.
A worker produces the fix. An engineer asks: does this project already have a function for this?
Part II: The Architecture Gap
Engineering is not code production
I keep coming back to a line I first used in my Microsoft article on production AI:
Production-ready AI is not defined by a benchmark score. It's defined by survivability under uncertainty. — Hazem Ali
The same principle applies to software engineering itself. Engineering is not defined by patches produced. It is defined by accountable stewardship of socio-technical systems under uncertainty.
Engineering versus development: the distinction nobody makes
Before we go further, I want to draw a line that most AI discourse blurs entirely. Development and engineering are not synonyms. Development is the act of translating a specification into working code. Engineering is the discipline of designing, building, and sustaining systems under competing constraints — safety, cost, schedule, regulation, team capability, operational reality, and the irreducible uncertainty of the real world.
Every engineer develops. Not every developer engineers.
Consider a concrete analogy. Imagine you are building a 100-story skyscraper.
The workers — steelworkers, welders, concrete pourers, crane operators — are essential. Without them, nothing gets built. They are skilled. They are fast. They are, in many cases, operating at extraordinary levels of craft.
But none of them decides whether the foundation needs 40-meter piles driven to bedrock or 20-meter friction piles in clay. None of them calculates the moment of inertia of a steel I-beam under lateral wind load at the 80th floor during a category-3 storm. None of them evaluates whether the soil bearing capacity at the site can support the dead load plus the live load plus seismic acceleration forces. None of them signs the structural certification that allows humans to occupy the building.
The engineer does all of that. The engineer integrates structural analysis, materials science, geotechnical data, fire safety codes, mechanical and electrical systems coordination, and the legal liability framework that says: if this building fails, I am accountable.
AI coding agents are the best workers we have ever had. They lay bricks faster than any human. They weld with remarkable consistency. They can pour concrete around the clock without fatigue.
But they do not know why the bricks go in that pattern. They do not understand that the pattern exists because a structural engineer calculated the load path, a fire safety engineer specified the egress route, and a building code mandated the minimum wall thickness for that occupancy classification. They produce the artifact. They do not own the reasoning that shaped it.
This is not a metaphor. It is a structural description of the gap between code production and engineering accountability. And it maps precisely onto the capabilities that current AI systems lack:
Requirements negotiation and ambiguity resolution. Real engineering starts before any code exists. The hardest problems are not "fix this bug." They are "what should we build, given competing constraints, incomplete information, and organizational politics." An LLM cannot attend a stakeholder meeting and notice that two departments have contradictory definitions of "customer." An engineer can.
Architecture and long-horizon design tradeoffs. Architecture decisions have consequences that unfold over years. Choosing an event-driven pattern over request-response is not a code decision. It is a bet on how the system will evolve, how teams will coordinate, and what failure modes you are willing to accept. These decisions require understanding organizational capacity, team topology, operational maturity, and business trajectory — none of which appear in issue text.
Security and reliability as first-class constraints. A large-scale security study of LLM and agent-generated patches on SWE-bench reports that standalone LLM patches introduced nearly 9× more new vulnerabilities than developer patches, and that greater autonomy can amplify vulnerability risk — especially when issues are underspecified.
Let me repeat that. Nine times more vulnerabilities. Not 9% more. Nine times.
This result is structurally predictable given how LLMs generate code. A transformer produces tokens by sampling from a learned conditional distribution . The generation pipeline includes no formal mechanism for verifying invariants, enforcing pre/post-conditions, or reasoning about adversarial input spaces — the model may learn defensive coding patterns from its training data and apply them with high probability, but there is no soundness guarantee. Compare this to formal verification, where a proof assistant like Coq or Lean mechanically checks that a program satisfies its specification via the Curry-Howard correspondence — the deep isomorphism between proofs and programs. An LLM does not construct proofs. It constructs plausible continuations. Those are categorically different mathematical objects, and the security gap is a direct consequence of that categorical difference. An engineer reasons about what a program must never do. A language model predicts what a token sequence probably looks like — and sometimes those predictions include secure code, but without the guarantee that engineering requires.
Governance and accountability. When a production system causes harm — data loss, a security breach, a compliance violation — someone is accountable. That accountability flows through human decision-makers who chose what to build, how to build it, and what risks to accept. An AI system cannot be accountable. It cannot be fired. It cannot testify in a regulatory hearing. It cannot explain why it chose to skip input validation because the training data suggested it was optional.
The NIST framing makes this precise
The NIST AI Risk Management Framework defines trustworthy AI systems through multiple characteristics: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair.
The framework explicitly states that neglecting these characteristics increases the probability and magnitude of negative consequences.
SWE-bench measures exactly one of these: validity under a narrow oracle. The rest — safety, security, accountability, explainability, privacy, fairness — are not tested, not measured, and not even represented in the benchmark's design.
That is not an oversight. It is the boundary of what benchmarks can do. And it is precisely the territory where engineering lives.
Part III: The Hardware Ceiling
This is the section most AI discourse avoids entirely. Because it requires understanding what actually happens on silicon when an LLM generates a token, and why the physics of inference imposes ceilings that no amount of software cleverness can remove.
I wrote an entire article on this — Kernel Dynamics: The Real Bottleneck of AI — and a deep dive into GPU virtual memory mechanics. What follows distills the implications for the "AI as engineer" question.
The memory wall is real, and it is not going away
Modern LLM inference is not compute-bound in the way most people imagine. During decode — the token-by-token generation phase that produces every character an AI agent writes — the workload is overwhelmingly memory-bandwidth bound.
Here is what happens mechanically. For every token generated:
- The model reads the KV cache — all previously computed keys and values — from GPU high-bandwidth memory (HBM)
- It performs a relatively small amount of computation (attention + FFN for one token)
- It writes the new KV entries back
The ratio of bytes moved to FLOPs computed is terrible. The arithmetic intensity is low. The GPU's tensor cores sit partially idle while the memory system struggles to keep up.
Prefill sells your benchmark. Decode pays your production bill. — Hazem Ali
An H100 has approximately 3.35 TB/s of HBM bandwidth. A 70B parameter model in FP16 has roughly 140 GB of weights. A single forward pass for one token touches a large fraction of those weights plus the KV cache. The KV cache itself grows linearly with sequence length — and for an AI coding agent working on a large repository, that context can be enormous.
Why this constrains agents specifically
An AI coding agent working on a real engineering task needs to:
- Hold the repository context in its working memory (context window)
- Maintain tool call history, file contents, test outputs, and error messages
- Generate multi-step plans with iterative refinement
Each of these demands long sequences. Long sequences mean:
- More KV cache — linear growth in GPU memory per request
- More memory bandwidth pressure — every decode step reads more past state
- Higher tail latency — p95/p99 latency spikes under concurrent load
- Quadratic attention pressure — self-attention cost grows with in time and memory for naive implementations
The Roofline reality
The Roofline model makes this constraint geometrically precise. For any compute kernel, achievable performance is bounded by , where is the platform's peak FLOP/s, is memory bandwidth, and is the kernel's operational intensity (FLOPs per byte transferred). During decode, the attention kernel's operational intensity collapses — each token attends to the entire KV cache but performs only FLOPs per cached entry, where is the head dimension (typically 64–128). On an H100 SXM, this places decode squarely in the memory-bound regime of the Roofline, far below the 989 TFLOPS FP16 peak. No amount of tensor-core optimization recovers throughput when the bottleneck is the 3.35 TB/s HBM3 read bandwidth. Even NVIDIA's B200 with HBM3e raises bandwidth to ~8 TB/s — a meaningful improvement, but one that shifts the ceiling rather than removing it. The operational intensity of autoregressive decode remains intrinsically low.
Speculative decoding — where a smaller draft model proposes candidate tokens verified in parallel by the target model — is the most promising throughput mitigation. It converts serial decode steps into batch-verifiable prefill steps, improving throughput by a factor proportional to the draft model's acceptance rate . But it introduces its own constraints: the draft model must approximate the target distribution closely enough to maintain high , and the verification step itself consumes KV cache and bandwidth proportional to . At 128K+ context lengths with complex repository state, acceptance rates degrade as the conditional distributions diverge, and amortized gains narrow. Speculative decoding shifts the constant. It does not change the asymptotic.
For Mixture-of-Experts architectures — Mixtral, DBRX, DeepSeek-V3, and their successors — only a subset of parameters is activated per token via a learned gating function, reducing FLOPs per forward pass. But the full parameter tensor still resides in HBM, and the routing layer must read gating weights, compute top- expert selection, and scatter-gather activations across expert shards every token. The memory footprint does not shrink proportionally to the active parameter count. Worse, the irregular memory access patterns of expert routing — where different tokens in a batch activate different experts — create TLB pressure, cache-line waste, and load-imbalance across SMs that regular dense models avoid. MoE buys FLOP efficiency at the cost of memory-access irregularity — a tradeoff that compounds under the long-context, high-concurrency regime that agent workloads demand.
The TLB and page-fault tax
This is the layer almost nobody discusses. I covered it extensively in my MMU article, but here is the executive summary for this argument.
When an LLM's working set exceeds what the GPU's Translation Lookaside Buffer (TLB) can cache, every memory access requires a page-table walk — multiple dependent memory reads to translate virtual addresses to physical ones. On an H100 with a 70B model:
- At 4 KB page granularity: ~36 million pages for weights alone
- The L1 TLB per SM holds a tiny fraction of those translations
- During decode, the GPU is almost continuously walking page tables
If the working set is oversubscribed and pages need to migrate, a GPU page fault can stall an entire warp of 32 threads. If enough SMs fault simultaneously, you have effectively stalled the entire GPU.
What this means for the "replace engineers" narrative
The hardware reality imposes hard ceilings on what AI agents can do in real time:
Latency ceiling. An agent that takes 45 seconds to generate a patch for a well-scoped bug cannot participate in a live architecture discussion, negotiate requirements in real time, or respond to an incident with the urgency humans bring. The token-by-token generation paradigm is fundamentally serial at decode time.
Context ceiling. Even with 128K or 1M token context windows, the effective context is constrained by attention degradation, KV cache memory pressure, and the quadratic cost of attending to everything. Real codebases are millions of lines. No current model can hold a meaningful representation of an entire system in working memory.
Concurrency ceiling. Serving an AI agent at scale means managing KV cache as a first-class resource. As I wrote in my Microsoft article:
The moment you treat inference as a multi-tenant memory system, not a model endpoint, you stop chasing incidents and start designing control. — Hazem Ali
Each concurrent agent session consumes GPU memory for its KV cache. Under real traffic, serving becomes memory admission control — can you accept this request without blowing the KV budget and collapsing batch size?
Determinism ceiling. GPU floating-point arithmetic is not strictly order-independent. Under concurrent serving, batch composition, kernel selection, and scheduling decisions can change. The same prompt can produce different outputs across runs — not because the model is random, but because the runtime execution path changed.
I demonstrated this live at CognitionX Dubai 2025. Same model, same weights, same GPU. The only thing that changed was context pressure. The audience watched latency degrade and throughput collapse in real time. Not because the model got weaker. Because the serving system ran into memory physics.
The energy ceiling nobody mentions
There is one more physical constraint. A single H100 GPU consumes approximately 700W under full load. A cluster serving thousands of concurrent agent sessions consumes megawatts. The energy cost of running an AI agent continuously — at the speed and context depth needed for real engineering work — is orders of magnitude higher than running a human engineer.
This is not an efficiency problem that scales away. It is a thermodynamic constraint rooted in Landauer's principle: every irreversible bit operation dissipates at least joules, where is Boltzmann's constant and is the operating temperature. Modern GPUs operate orders of magnitude above this theoretical floor, but the trajectory matters — energy per operation has been declining logarithmically while model parameter counts have been growing exponentially. The curves do not converge favorably. Every token generated is an energy transaction. Every KV cache read is a memory bus transaction that dissipates heat. When you factor in datacenter Power Usage Effectiveness (PUE) — typically 1.1–1.4 for hyperscale facilities, accounting for cooling, networking, power conversion, and storage overhead — the delivered energy per useful FLOP carries substantial multiplicative overhead. At scale, the question is not just "can AI do the work?" but "can we afford the energy budget for AI to do the work at the speed and quality required?"
Part IV: The Governance Void
Why accountability cannot be tokenized
In my Zero-Trust Agent Architecture article, I framed the mindset clearly:
Once an agent can call tools that mutate state, treat it like a privileged service, not a chatbot. — Hazem Ali
The same principle applies to AI-as-engineer claims. The moment you position an AI system as "doing engineering," you are claiming it can:
- Make tradeoff decisions that affect system reliability
- Accept or reject security risks on behalf of an organization
- Own the consequences when those decisions fail
No current AI system can do any of these things. Not because the models are not smart enough. Because accountability is a human contract, not a computational one.
The NIST AI RMF makes this precise. Trustworthy AI requires accountability structures, role clarity, and defined responsibilities. These are organizational properties, not model properties. You cannot fine-tune accountability into weights.
The regulatory landscape reinforces this structurally. The EU AI Act — which entered into force in August 2024 with phased compliance deadlines through 2027 — classifies AI systems by risk tier and imposes binding obligations on high-risk systems: conformity assessments, technical documentation, human oversight mechanisms, and post-market monitoring. An AI coding agent deployed to modify safety-critical software (medical devices, financial infrastructure, autonomous systems) would fall under the high-risk classification, triggering obligations that presuppose a responsible natural or legal person — not a model. Article 14 explicitly requires that high-risk AI systems be designed to allow effective human oversight, including the ability to "fully understand the capacities and limitations of the high-risk AI system." No current LLM satisfies this interpretability requirement for the code it generates — the internal representations that produce a code patch are not inspectable in any legally meaningful sense. The regulatory framework does not merely suggest human accountability. It codifies it into law.
The security gap is not theoretical
The large-scale security study on SWE-bench patches deserves a dedicated section because the numbers are stark:
- Standalone LLM patches introduced nearly 9× more new vulnerabilities than developer patches
- Greater autonomy amplified vulnerability risk
- Underspecified issues produced the highest vulnerability rates
This is not a prompt engineering problem. This is a fundamental capability gap. A human engineer reads an issue, considers attack surfaces, thinks about what an adversary could do with the input, and writes defensive code by default. An LLM generates the most probable next token given its training distribution. Those are structurally different processes, and they produce structurally different security outcomes.
What organizations should actually do
The defensible position is not "AI can't write code." It clearly can. The defensible position is:
Part V: The Illusion of Understanding — Thinking, Reasoning, and the Limits of Cognition
Statistical generation is not formal reasoning
The AI industry has adopted the word "reasoning" to describe what large language models do. This framing is useful shorthand but obscures a critical engineering distinction.
Reasoning, in the philosophical and cognitive science traditions, involves the construction of valid inferences from premises to conclusions through rules of logic. Deductive reasoning preserves truth: if the premises are true and the inference rules are valid, the conclusion must be true. Inductive reasoning generalizes from observations with acknowledged uncertainty. Abductive reasoning infers the best explanation from incomplete evidence.
What an LLM does is fundamentally different in mechanism, even when it produces similar-looking output. It computes a conditional probability distribution over the next token given the preceding context: . The parameters of that distribution were learned by minimizing cross-entropy loss over a massive text corpus. The resulting behavior can mimic the surface form of reasoning — and models can learn and execute algorithmic patterns that behave like rule application — but the underlying mechanism is statistical prediction, not formal inference. There is no proof engine, no truth-preservation guarantee, no soundness contract.
This distinction is not pedantic. It has direct engineering consequences. A formal system that applies modus ponens will never conclude from without . An LLM can and will, if the token sequence appears frequently enough in its training distribution — because it is not applying modus ponens as a logical rule. It is predicting likely continuations. The logical validity of those continuations is incidental, not guaranteed. The model may often get it right — impressively often — but "often correct" and "guaranteed correct" are categorically different engineering properties.
The Chinese Room, updated
John Searle's Chinese Room argument (1980) remains the most precise articulation of the gap between simulation and understanding. The thought experiment describes a person in a room who receives Chinese characters, follows a lookup table to manipulate them, and produces output that appears to be fluent Chinese — without understanding a single character. The person simulates linguistic competence without possessing it.
An LLM shares the Chinese Room's core property: it processes token sequences through layers of matrix multiplications and nonlinearities, producing output that simulates competent code authorship, without the kind of causally grounded understanding that a human engineer possesses. Modern models do form internal latent representations — they develop structures that encode syntactic relationships, semantic similarity, and even some functional properties of code. These representations are real and sometimes surprisingly rich. But they are lossy, inconsistent under distribution shift, and ungrounded in physical reality. The model has a statistical proxy for what a mutex does in code — it has never experienced what happens when concurrent threads corrupt shared state on real hardware.
The Stochastic Parrots framing (Bender et al., 2021) makes a related point from computational linguistics: language models produce text that is statistically consistent with their training data, but this consistency should not be confused with the kind of causal understanding that underwrites engineering accountability. The symbol grounding problem — how symbols in a formal system acquire meaning by being connected to the physical world — remains a deep open question for transformer architectures. The model has learned from descriptions of memory, crashes, and pointer dereferences arranged as token sequences. It can generalize from those descriptions with remarkable facility. But there is a structural difference between a model that has learned patterns about buffer overflows from text and an engineer who has watched a stack smash redirect an instruction pointer in a debugger. That difference is grounding — and grounding is what connects understanding to accountability.
When "thinking" becomes marketing
The AI industry now markets certain model capabilities as "thinking" — extended chain-of-thought generation where the model produces intermediate steps before arriving at an answer. Let me be precise about what this is and what it is not.
Extended chain-of-thought is a genuine and effective inference-time compute strategy. Giving the model more steps — and sometimes sampling multiple reasoning traces — measurably improves performance because the model can explore more intermediate structure before committing to an answer. This is a real capability improvement, not a gimmick.
But the core generator is still next-token prediction. Even in "reasoning models," the underlying engine produces tokens autoregressively. The chain of thought is generated by the same mechanism that generates everything else. The model can learn to execute algorithmic patterns that function like deliberation — and this learned behavior is genuinely useful — but it operates without the formal guarantees of a proof system or the persistent, externally grounded world state that human cognition maintains.
The critical engineering limitation is not that LLMs never catch errors — they sometimes do, mid-generation, and many reasoning stacks explicitly add verification passes, self-critique, tool-based validation, or separate verifier models. The limitation is that there is no guaranteed, correctness-preserving validator in the generation pipeline the way there is in a formal system or proof checker. A proof assistant like Lean or Coq mechanically verifies each step against well-founded inference rules. An LLM's self-correction is probabilistic — it works often enough to improve aggregate benchmark scores, but it provides no soundness guarantee for any individual output.
The practical consequence for engineering is severe. An LLM can produce a chain of thought that looks like careful architectural reasoning but contains a subtle category error — confusing eventual consistency with strong consistency, or conflating authentication with authorization. It might catch such errors sometimes. It might not. And there is no way to know which case you are in without independent verification. The output is fluent, confident, and sometimes wrong. And because it looks like reasoning, it is more dangerous than an obviously random error: it carries false epistemic authority.
When the industry uses "reasoning" to mean "extended autoregressive generation with intermediate steps," it is borrowing the epistemic authority of the word without the formal guarantees the word implies. Engineers should name things precisely, especially when precision is commercially inconvenient. — Hazem Ali
Accountability requires understanding, and understanding requires grounding
This brings us full circle to the governance argument, but at a deeper epistemological level. Accountability is not merely a legal or organizational structure — it is predicated on the accountable agent's capacity to understand the consequences of their decisions.
When a structural engineer certifies a building for occupancy, they understand — in a deep, causally grounded sense — what happens when steel yields under tensile stress. They understand differential foundation settlement. They understand wind vortex shedding and resonance with a structure's natural frequency. That understanding is not pattern matching over textbook pages. It is a causal model of physical reality, constructed through years of education, laboratory work, field observation, and direct accountability for outcomes.
An AI system that generates a security-critical code path has learned statistical patterns about buffer overflows from vast amounts of code and documentation. It may reproduce defensive patterns with high probability. It may even generalize to novel variations that were not explicitly in its training data — the internal representations models develop are richer than simple lookup tables. But its knowledge is derived entirely from textual descriptions and code examples. It has no causal model of the von Neumann execution cycle, no grounded concept of an adversary with a debugger, and no experience of what happens at the hardware level when a write past the end of a stack-allocated buffer overwrites the saved return address on the stack frame. The model's representation of "buffer overflow" is a statistical structure in latent space. An engineer's understanding of "buffer overflow" is a causal model grounded in direct observation of stack frames, instruction pointers, and exploit behavior.
This gap does not necessarily disappear with scale alone. Models may develop increasingly sophisticated internal representations as they grow — recent interpretability research (Li et al., 2023; Nanda et al., 2023) shows that transformers can learn non-trivial algorithmic structure. But the distance between "learned a useful statistical proxy" and "constructed a causally grounded model sufficient for engineering accountability" remains significant. Scaling may narrow the proxy's error rate. It does not, by current evidence, transform the proxy into grounded understanding.
Accountability requires not just producing correct output, but being able to justify why the output is correct under adversarial scrutiny — in a design review, in an incident postmortem, in a regulatory hearing. A model that produces a correct answer 95% of the time is a powerful tool. A model that cannot explain which 5% are wrong is not an engineer. And building organizational trust on probabilistic performance without grounded justification is how institutions accumulate unmanaged risk.
Part VI: The Deeper Ceilings — What Forty Years of Engineering Science Already Proved
The arguments in Parts I–V are grounded in current empirical evidence: benchmarks, hardware specifications, security audits. But the case against "AI as engineer" runs deeper than present-day observations. Forty years of peer-reviewed research in computability theory, human factors engineering, systems safety, formal verification, and software engineering theory have already established — with mathematical proof and documented catastrophe — the exact ceilings we are now rediscovering in the AI coding discourse.
Every claim in this section is sourced to a specific peer-reviewed publication, formally proven theorem, or independently documented engineering disaster. Nothing here is opinion.
Bainbridge's Ironies of Automation — The Paradox of Deskilling
Peer review: Lisanne Bainbridge, "Ironies of Automation," Automatica, Vol. 19, No. 6, pp. 775–779, 1983. Published by Pergamon Press (now Elsevier). Peer-reviewed by the International Federation of Automatic Control (IFAC). Cited over 4,300 times. Validated empirically by the FAA's 2013 report on operational flight safety and by NASA's Aviation Safety Reporting System data.
Bainbridge proved — through rigorous analysis of human operator performance in automated process control — that automation creates a compounding paradox with two ironies:
-
The skill-decay irony. The more a task is automated, the less the human operator practices it. When automation fails, the human — now responsible for manual intervention — is measurably less capable than they were before automation was introduced.
-
The hardest-failure irony. Automation is applied to tasks precisely because they are difficult. Therefore, the failures that automation cannot handle are, by definition, the hardest cases — and the human is now expected to handle the hardest cases with degraded skills.
This is not theoretical. It has been empirically validated across decades in aviation (Air France Flight 447, 2009 — pilots could not manually recover from a stall after autopilot disconnected because they had insufficient manual flying experience), nuclear power plant operations (Three Mile Island partial meltdown, 1979 — operators could not interpret plant state when automated systems produced contradictory readings), and medical device monitoring.
The mapping to AI coding agents is structurally identical:
Rice's Theorem — Semantic Code Properties Are Undecidable
Peer review: Henry Gordon Rice, "Classes of Recursively Enumerable Sets and Their Decision Problems," Transactions of the American Mathematical Society, Vol. 74, No. 2, pp. 358–366, 1953. Peer-reviewed by the American Mathematical Society. This is a mathematical theorem — it is proven, not hypothesized. It has the same epistemic status as the Pythagorean theorem.
Rice's theorem proves that for any non-trivial semantic property of programs, no algorithm can decide whether an arbitrary program has that property. A "non-trivial" property is one that some programs have and some do not. "Terminates on all inputs." "Never accesses memory out of bounds." "Never leaks credentials." "Preserves the invariant that account balances are non-negative." These are all non-trivial semantic properties. Rice proved — with mathematical certainty — that no general algorithm can decide any of them.
The engineering implication is precise and non-negotiable. When someone claims an AI agent can determine whether its generated code is "correct" or "secure," they are claiming it can decide a non-trivial semantic property of programs. Rice proved this is impossible in 1953. The proof is mathematical — it does not expire, it does not depend on model scale, and it is not overcome by more training data. Engineers work around undecidability through domain restriction, invariant-based design, and formal verification of specific programs. These workarounds require knowing which mathematical subset of the problem you are operating in. An LLM has no such knowledge. It generates tokens.
The Therac-25 — When Component-Level Correctness Kills
Peer review: Nancy G. Leveson and Clark S. Turner, "An Investigation of the Therac-25 Accidents," IEEE Computer, Vol. 26, No. 7, pp. 18–41, July 1993. Peer-reviewed by the IEEE Computer Society. Extended in Nancy G. Leveson, Engineering a Safer World, MIT Press, 2011. The Therac-25 case is the most extensively studied software-related disaster in computing history, documented in over 200 academic publications.
Between 1985 and 1987, the Therac-25 — a computer-controlled radiation therapy machine — delivered lethal radiation overdoses to at least six patients, killing at least three. The root cause was not a "code bug" in the narrow sense that any benchmark would catch. It was a race condition between operator input speed and software mode-setting that was invisible to component-level testing.
The critical detail: the Therac-25 reused software from the Therac-20. In the Therac-20, hardware interlocks physically prevented the electron beam from activating in the wrong mode — even if the software entered an incorrect state. When the Therac-25 removed the hardware interlocks and relied entirely on software control, the race condition became lethal. The software passed all tests in both systems. The deaths were caused by an emergent system interaction, not a component defect.
Thompson's Trusting Trust — The Trust Chain AI Cannot Enter
Peer review: Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, Vol. 27, No. 8, pp. 761–763, August 1984. Turing Award Lecture, Association for Computing Machinery. Thompson received the Turing Award (1983) jointly with Dennis Ritchie for the development of Unix.
Thompson demonstrated that a compiler can be modified to: (a) insert a backdoor into any login program it compiles, and (b) insert the backdoor-insertion code into any compiler it compiles — then the original modifications are removed from the source code. The attack is undetectable by source code inspection, code review, or static analysis. The trojan propagates through the compiler binary, not the source.
Thompson's insight is forty-one years old and remains unfalsified. You cannot establish trust in software through inspection alone. Trust is an institutional property, not a computational one. AI coding agents can generate correct code, but they cannot participate in the trust institutions — legal liability, professional certification, reproducible build verification, regulatory compliance — that make correctness trustworthy. This is not a capability gap that closes with model scale. It is a categorical boundary between computation and accountability.
Brooks's Essential Complexity — The Irreducible Core
Peer review: Frederick P. Brooks Jr., "No Silver Bullet — Essence and Accident in Software Engineering," Proceedings of the IFIP Tenth World Computing Conference, pp. 1069–1076, 1986. Republished in IEEE Computer, Vol. 20, No. 4, pp. 10–19, April 1987. Peer-reviewed by the IEEE Computer Society. Brooks received the ACM Turing Award in 1999.
Brooks made a distinction in 1986 that predicts — with remarkable precision — the exact contours of AI coding agent productivity forty years later. Software complexity has two fundamentally different components:
- Accidental complexity: artifacts of our tools, languages, and representations. Boilerplate code, syntactic noise, build configuration, manual memory management, repetitive CRUD patterns, test scaffolding.
- Essential complexity: inherent in the problem domain itself. What should the system do when two business rules contradict? What consistency model is appropriate given the CAP constraints? Which failure modes does the business accept? How do regulatory requirements interact with performance requirements?
Brooks's thesis: no tool — no matter how powerful — can reduce essential complexity, because essential complexity comes from the problem, not from the representation. Only accidental complexity can be compressed by better tools.
Brooks's prediction is now testable against real data. Organizations adopting AI agents for accidental-complexity tasks report significant productivity gains — and they are real. Organizations that extend AI agents to essential-complexity tasks — architecture decisions, requirements negotiation, governance — will discover what Brooks already proved: the dominant cost of software engineering is irreducible by any tool. The acceleration applies only to the smaller portion of the work.
CompCert and seL4 — What "Correct" Actually Costs
Peer review:
- Xavier Leroy, "Formal Verification of a Realistic Compiler," Communications of the ACM, Vol. 52, No. 7, pp. 107–115, July 2009. Peer-reviewed by ACM. Originally presented at POPL 2006 (ACM SIGPLAN).
- Gerwin Klein et al., "seL4: Formal Verification of an OS Kernel," Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pp. 207–220, 2009. Peer-reviewed by ACM SIGOPS. Winner of the ACM SIGOPS Hall of Fame Award (2019) and the ACM Software System Award (2022).
- Xuejun Yang et al., "Finding and Understanding Bugs in C Compilers," Proceedings of PLDI 2011, ACM, pp. 283–294. Peer-reviewed by ACM SIGPLAN. (The CSmith validation study.)
Two formally verified systems quantify the chasm between "generates code that passes tests" and "produces artifacts whose correctness is mathematically established."
CompCert is a formally verified optimizing C compiler. Its correctness proof — mechanically checked by the Coq proof assistant — guarantees that the compiled code preserves the semantics of the source program. Not "probably." Not "in our tests." Mathematically. The proof required approximately 100,000 lines of machine-checked Coq for ~20,000 lines of compiler code — a ~5:1 proof-to-code ratio — and roughly 6 person-years.
seL4 is a formally verified microkernel. Its functional correctness proof establishes that the implementation exactly matches its specification. The proof required approximately 200,000 lines of Isabelle/HOL for ~10,000 lines of C — a 20:1 proof-to-code ratio — and roughly 20 person-years.
The empirical validation is striking. The CSmith random testing project (Yang et al., PLDI 2011) generated hundreds of thousands of random C programs and compiled them with GCC, LLVM, and CompCert. CSmith found hundreds of bugs in both GCC and LLVM — production compilers maintained by large teams for decades. It found zero bugs in CompCert's verified passes. This is the difference between "passes tests" and "is correct." CompCert's correctness was established by proof. The tests merely confirmed what the proof already guaranteed.
The Frame Problem — What Changes Don't Change
Peer review: John McCarthy and Patrick J. Hayes, "Some Philosophical Problems from the Standpoint of Artificial Intelligence," Machine Intelligence 4, Edinburgh University Press, pp. 463–502, 1969. Foundational work in AI philosophy. Extensively analyzed in Murray Shanahan, "The Frame Problem," Stanford Encyclopedia of Philosophy, 2016 (peer-reviewed philosophical reference). Extended formally by Reiter (1991), Thielscher (1997), and others in the knowledge representation community.
The frame problem — identified at the birth of AI research — is the difficulty of representing what a change does not affect. When you move a block in a blocks-world, every other block's position, color, weight, and material remain unchanged. These "non-effects" vastly outnumber the effects, and representing them explicitly grows combinatorially.
In software engineering, this manifests as the invariant preservation problem: when you modify one component, which system invariants must remain true, and how do you verify that they still hold?
The frame problem explains — with the precision of a 1969 foundational AI result — why AI agents produce patches that are "locally correct, globally wrong." The agent addresses the stated change. It cannot enumerate the unstated non-changes. In a system with invariants, a change that directly touches of them requires verifying that invariants are preserved. The agent verifies zero of them unless the test oracle coincidentally covers them. The engineer verifies all of them — or at least all the ones their mental model of the system encompasses — because invariant preservation is what engineering is.
Rice proved you cannot algorithmically decide semantic properties. McCarthy showed you cannot tractably enumerate non-effects. Leveson proved that safety is emergent, not component-level. Thompson proved that trust requires institutions, not inspection. Brooks proved that essential complexity is irreducible. Bainbridge proved that automation degrades the skills needed to supervise it. These are not opinions. They are theorems, proofs, and documented catastrophes — peer-reviewed and validated across decades. The "AI replaces engineers" claim must contend with all of them simultaneously. It has contended with none. — Hazem Ali
Part VII: The Extraordinary Tool — And The Discipline It Demands
AI is genuinely powerful. That is precisely why precision matters.
I want to be explicit about something before this article is misread as anti-AI. It is not. I have spent the last several years building AI systems, deploying them in production, speaking about them internationally, and publishing on their architecture at Microsoft. I am not standing outside the field throwing stones. I am standing inside it, building with these tools every day.
AI coding agents represent a genuine paradigm shift in developer productivity. The ability to generate boilerplate, scaffold test harnesses, translate between languages, explain unfamiliar codebases, and produce first-draft implementations at machine speed — these capabilities are real, they are valuable, and they are transforming how software gets made.
For bounded, well-scoped, test-verifiable tasks — the kind of work that used to consume 40% of a senior engineer's week — AI agents are not just useful. They are extraordinary. The engineer who refuses to use them is not demonstrating craftsmanship. They are demonstrating the same inertia that made people resist version control, CI/CD, and automated testing.
AI agents are the most powerful labor amplifier software engineering has ever seen. The mistake is confusing labor amplification with engineering replacement. — Hazem Ali
But the power of the tool makes disciplined thinking more important, not less. When a tool can produce plausible output at high speed, the cost of accepting wrong output uncritically also increases at high speed. The acceleration applies in both directions.
The epidemic of unverified claims
This brings me to something that has been eroding the integrity of the AI discourse: the proliferation of unverified claims presented as evidence of engineering-level capability.
A prominent example that circulated widely. A leading AI company published a detailed engineering blog post describing how their AI agent built a C compiler. The headline framing — "from scratch" — was cited across the industry as evidence that AI agents can now perform complex, systems-level engineering autonomously. The claim was treated as a milestone.
I read the entire post. Carefully. And what the post actually describes is something genuinely impressive, but categorically different from what the headline implies.
What "mostly from scratch" actually means — a close reading
The post itself, to its credit, contains the evidence needed to understand the real achievement. But most readers stopped at the headline. So let me walk through what the post describes versus what it was interpreted to mean.
Misconception 1: "The AI agent engineered the system autonomously."
What the post actually describes is that the human contribution was the harness and evaluation environment, not just the initial prompt. The author explicitly states that the loop only works if the model can tell how to make progress, and that most effort went into the environment around the model: tests, build infrastructure, feedback mechanisms. That is engineering. The human built the verification system, the CI pipeline, the test selection logic, and the failure-mode iteration loop. The model generated code inside that system. The system itself was engineered by humans.
Misconception 2: "No ongoing steering or correction happened."
The post describes repeated adjustments to the harness based on observed model failures. Near the end, the model started breaking old functionality when adding new features — a classic regression pattern. The response? The human built CI and stricter enforcement to prevent regressions. That is ongoing intervention at the process level. Not typing code, but shaping the constraint system that determines what "success" means. That is engineering. The model was the worker inside the constraints. The human was the engineer defining and adjusting them.
Misconception 3: "Parallel agents achieved independent, self-coordinating progress."
The post describes hitting a ceiling precisely when the task became globally coupled — compiling the Linux kernel. Agents got stuck because they hit the same bug and overwrote each other's changes. "Mostly walked away" does not mean "the multi-agent system robustly self-coordinates on hard coupled work." It means the opposite: when coordination was required, the system failed, and humans had to intervene with architectural solutions.
Misconception 4: "No external oracles or scaffolding were used."
The post describes using GCC as a known-good oracle to make the kernel compilation task parallelizable. The human used GCC to compile most files and narrowed down which subset failed under the model's compiler, enabling parallel debugging. That is a classic engineering move: introduce a trusted reference system to isolate faults. The model did not invent this strategy. The engineer did.
Misconception 5: "Clean-room implementation means no prior knowledge leakage."
The post states that the model had no internet access "at any point during its development." That is an execution constraint. It is not a training-data guarantee. The model's weights were trained on vast corpora that include compiler source code, C language specifications, LLVM documentation, GCC internals, and decades of compiler construction literature. "Clean-room" in this context means the model did not look things up during generation. It does not mean the model had no statistical prior over compiler-like code patterns. Those are fundamentally different claims, and conflating them is misleading.
Misconception 6: "This is a production-ready, drop-in compiler."
The post itself explicitly lists limitations: calls out to GCC in certain places, assembler and linker issues, code inefficiency, and not being drop-in ready. These are not minor caveats. They are the gap between a demonstration and a tool you would trust to compile software that other people rely on.
The accurate reading is: engineers mostly walked away from day-to-day coding, but not from engineering the verification harness, designing tests, building CI, creating oracles, and iterating on the environment so the model could self-correct. That is not "AI as engineer." That is "AI as worker inside a human-engineered system." — Hazem Ali
Why a C compiler cannot be built from scratch by an AI agent
Now let me make the technical case for why the "from scratch" framing is not just misleading in this instance, but structurally impossible given what a C compiler actually requires — from the deepest levels of hardware and low-level software.
A C compiler is not a code generation task. It is a formal language translation pipeline that must be correct at every stage, down to individual bits in the output binary. The consequences of miscompilation are not test failures. They are silent correctness violations in every program the compiler ever touches — the most dangerous class of software bug, because the program compiles, runs, and produces wrong results without any visible error.
Here is what "from scratch" actually requires:
1. Lexical analysis conforming to the C standard's translation model.
C11 §5.1.1.2 defines an 8-phase translation model. Phase 1 handles trigraph replacement. Phase 2 handles line splicing (backslash-newline). Phase 3 decomposes source into preprocessing tokens and whitespace. A correct lexer must handle trigraphs, digraphs, universal character names (UCNs), and the subtle interactions between preprocessing and tokenization that have produced bugs in production compilers for decades. This is not pattern matching. It is formal language processing governed by a 700-page normative standard.
2. Parsing a context-sensitive grammar.
C's grammar is not context-free. The classic example is the lexer hack: T * x; can be either a pointer declaration or a multiplication expression, depending on whether T is a typedef name in the current scope. The parser must feed symbol-table state back to the lexer in real time. This is not a detail an LLM can learn from code examples — it is a formal property of the language that requires the parser and semantic analyzer to cooperate at a level that violates the clean separation of concerns that LLMs are trained to expect.
3. Correct integer promotion and usual arithmetic conversions.
C's integer promotion rules (§6.3.1.1) and usual arithmetic conversion rules (§6.3.1.8) are notoriously subtle. unsigned int compared with int follows different rules than unsigned short compared with int, because the latter undergoes integer promotion to int first. Getting this wrong does not crash the program. It silently changes the semantics of comparisons, arithmetic, and bitwise operations. Even experienced C programmers get these rules wrong. A compiler must get them right for every expression, in every context, on every target platform — because the width of int, long, and long long varies across platforms, and the promotion rules depend on the relative widths.
4. Undefined behavior — approximately 200 instances in C11.
A C compiler must handle undefined behavior (UB) correctly, which often means exploiting it for optimization. When the standard says signed integer overflow is UB, a production compiler like GCC or Clang will assume it never happens and optimize accordingly. This is not a heuristic. It is a formal contract between the language specification and the optimizer. An LLM generating compiler code may learn common UB-related patterns from training data, but it has no formal mechanism for systematically reasoning about this contract — determining which optimizations are permitted by UB and which are prohibited by defined behavior, across hundreds of distinct UB instances, requires specification-level compliance that statistical prediction does not guarantee.
5. Target-specific code generation and the ABI contract.
This is where "from scratch" collapses entirely. Generating correct machine code for x86-64 requires:
- Correct instruction encoding (variable-length, prefix-dependent, with ModR/M and SIB byte encoding)
- Register allocation that respects the calling convention (System V AMD64 ABI: first 6 integer args in RDI, RSI, RDX, RCX, R8, R9; floating-point in XMM0-XMM7; callee-saved registers RBX, RBP, R12-R15; red zone below RSP)
- Stack frame layout with correct alignment (16-byte alignment at
callinstruction, per ABI) - Struct layout and padding that matches the platform ABI to the byte — because every FFI call, every system call, every interaction with any other compiled code depends on the caller and callee agreeing on exactly where every field lives in memory
- Correct ELF object file generation with section headers, symbol tables, relocations (R_X86_64_PC32, R_X86_64_PLT32, R_X86_64_GOTPCRELX, etc.), and GOT/PLT entries for position-independent code
Each of these domains has edge cases that have produced bugs in production compilers maintained by hundreds of engineers over decades. An AI agent cannot generate a correct instruction encoder for x86-64 from scratch because the x86-64 encoding scheme — inherited from 8086 through 80386 through AMD64 — is one of the most irregular, historically-layered instruction encodings in computing history. It is not learnable from examples. It requires byte-level specification compliance.
6. Optimization passes and their interactions.
A production C compiler performs dozens of optimization passes: dead code elimination, constant propagation, loop-invariant code motion, strength reduction, instruction scheduling, register allocation (an NP-hard problem typically solved by graph coloring or linear scan heuristics), auto-vectorization, and interprocedural analysis. These passes interact — the order matters, and an optimization that is correct in isolation can produce wrong code when composed with another. GCC has over 15 million lines of code. LLVM has over 30 million. These are not large because their developers were inefficient. They are large because the problem is genuinely that complex.
The fundamental architectural impossibility
The deeper reason an AI agent cannot build a C compiler "from scratch" is not about any single missing capability. It is about the verification problem at the intersection of hardware and software.
A compiler's output is machine code. Machine code executes on a physical CPU that fetches instructions through a pipeline: instruction fetch → decode → issue → execute → writeback → retire. The correctness of the compiled output depends on the compiler's internal model of this pipeline matching the actual hardware behavior to the bit. Consider what happens at the boundary:
At each stage, the correctness contract changes. The C standard defines behavior in terms of an abstract machine. The ABI defines behavior in terms of register conventions and memory layout. The ISA manual defines behavior in terms of instruction semantics. The CPU implements those semantics in silicon, with microarchitectural details (out-of-order execution, speculative execution, memory ordering) that can expose compiler bugs that no test suite catches.
An LLM operates at the token level. It has no representation of the abstract machine, no model of the pipeline, no understanding of what happens when a MOV instruction's operand encoding collides with a REX prefix in a way that changes the register width from 32 bits to 64 bits. These are not things that can be learned by statistical proximity in training data. They are formal correctness properties that require bit-level verification against a hardware specification.
This is why GCC and LLVM have dedicated teams for each target architecture. This is why compiler correctness research — CompCert, Alive2, CSmith — exists as entire subfields. And this is why "from scratch" is not a qualifier you can attach casually to a compiler, any more than you can say a bridge was "mostly" load-tested.
Do your own research. Verify every claim.
I want to offer direct advice to every engineer, every engineering manager, and every technical leader reading this.
Stop accepting benchmark scores as evidence of capability. When an AI company publishes a number — "72% on SWE-bench Verified," "built a C compiler from scratch," "resolves 90% of Jira tickets" — ask the questions that an engineer asks:
- What was the evaluation methodology?
- Was it independently reproduced?
- What was excluded from the benchmark?
- What does "from scratch" mean, precisely? What components were reused?
- What is the false positive rate? The silent failure rate?
- Would you stake your production system on this claim without independent verification?
Read the papers, not the press releases. The gap between what a research paper actually demonstrates and what the marketing summary claims is often enormous. A paper that shows an agent can fix isolated bugs in Python repositories becomes a press release claiming "AI can now do the work of a software engineer." Those are not the same statement. One is a narrow empirical finding. The other is an unsupported generalization.
Reproduce before you adopt. Before deploying an AI agent into your engineering workflow based on published results, run it on your codebase, with your test suites, under your operational constraints. If the vendor resists independent evaluation, that tells you something important about the claim.
Cultivate skepticism as a professional discipline. In the current AI discourse, skepticism is often framed as pessimism or technophobia. It is neither. It is engineering. Engineers are professionally obligated to question claims, demand evidence, and distinguish between demonstrated capability and projected potential. That obligation does not diminish because the technology is exciting. It intensifies.
The most dangerous moment in any technology cycle is when the marketing outpaces the engineering. We are in that moment now. — Hazem Ali
The Line I Keep Drawing
I want to close with something I have been saying in different forms across my writing, my talks, and my architecture reviews.
The question is not whether AI can write code. It can. Often impressively. Sometimes brilliantly.
The question is whether "writing code" is what engineering means.
It is not.
Engineering is requirements negotiation in a room full of competing priorities. Engineering is choosing a pattern that will survive three years of team turnover. Engineering is saying "no" to a feature because the security cost is too high. Engineering is owning the 3 AM incident when the system you designed fails in a way you predicted was unlikely but possible.
AI agents do not attend the meeting. They do not feel the weight of an incident. They do not carry the memory of a production failure into the next architecture decision.
They produce output. Rapidly. Sometimes correctly. Often usefully.
But output is not ownership. And engineering, at its core, is ownership.
Benchmarks measure patch production under bounded oracles. Engineering is accountable stewardship of socio-technical systems under uncertainty. Those are not the same thing, and pretending they are is how organizations ship unmanaged risk. — Hazem Ali
Use AI agents. They are extraordinary tools. Build them into your workflows. Let them handle the bounded, well-scoped, test-verifiable work that used to consume your afternoons.
But do not confuse the worker with the engineer.
The worker produces the patch.
The engineer decides whether to ship it.
Aren't benchmarks improving? Won't contamination be fixed?
What about agents that can browse documentation and use tools?
Doesn't this argument apply to junior engineers too?
What about fine-tuning on internal codebases?
Is this position permanent or will AI eventually become an engineer?
What is the difference between engineering and development?
Can AI really build a compiler from scratch?
Don't 'reasoning models' like o1 and o3 actually reason?
Isn't the Chinese Room argument outdated?
Should I stop using AI coding agents?
This article builds on ideas I have explored across my Microsoft Tech Community publications. If you want the full technical depth behind the hardware constraints discussed here, start with The Hidden Memory Architecture of LLMs. For the production architecture and governance framing, see AI Didn't Break Your Production — Your Architecture Did and Zero-Trust Agent Architecture.
If you are building production AI systems and dealing with the real constraints — not the demo version — feel free to connect with me on LinkedIn.
— Hazem Ali
Microsoft AI MVP, Distinguished AI & ML Engineer / Architect



