5 min read

The Circular Fix

The Circular Fix

On March 9, 2026, Anthropic launched Claude Code Review — a multi-agent AI verification system designed to catch bugs in AI-assisted code. Five days earlier, on March 4, Anthropic had shipped the first of three overlapping changes that would degrade Claude Code itself for six weeks.

The verification tool needed verification.

This isn't an anecdote. It's the shape of the problem.

The Proposal

The AI coding industry has a verification crisis. I covered the economics in my last article: more code, fewer checkers, permanent debt. The industry's proposed solution is elegant — use AI to review the AI-generated code that humans can't review fast enough.

The desperation behind this proposal is quantifiable. Faros AI's 2026 Engineering Report — drawing on telemetry from 22,000 developers across 4,000+ teams — measured what happens when AI-generated code hits human review infrastructure:

Metric Change
Median time in code review +441.5%
PRs merged without any review +31.3%
Bugs per developer +54%
Code churn (lines deleted / lines added) +861%
Incident-to-PR ratio 3× higher

Review times quintupled. The system's response was not to review better but to stop reviewing — 31.3% more pull requests merged with no review at all. Bugs rose 54%. Code churn — the ratio of lines deleted to lines added, a direct measure of rework — rose 861%.

Faros calls this "Acceleration Whiplash." AI flooded a system built around human-paced development with output it was never designed to absorb. The throughput numbers look good: epics per developer up 66%, task throughput up 33.7%. The downstream numbers are catastrophic.

So the proposal makes sense: if humans can't review fast enough, let AI do it. But here's the question no one selling AI code review tools wants to answer: does it actually work?

The Same Blind Spot

Christo Zietsman's March 2026 paper "The Specification as Quality Gate" asks this directly. His finding: deploying AI reviewers to catch AI-generated problems is structurally circular when executable specifications are absent.

"The review checks code against itself, not against intent. Both agents share the same training distribution and exhibit correlated failures."
— Zietsman, ArXiv 2603.25773

The argument is precise: when a language model generates code and another language model reviews it, both are reasoning from the same artifact — the code itself. Without an external reference (a specification, a formal contract, a deterministic test), the reviewer has no independent basis for judgment. It checks plausibility against its training distribution. The generator checked the same thing. Their errors correlate.

Zietsman tested this with planted bug experiments — Claude reviewing Claude-generated code, then cross-family panels of four models from three families. The correlated failures weren't theoretical. Same-family review missed bugs that the generating model had missed for the same reason: the pattern looked right to models trained on the same data.

How Well Does It Work?

Three independent measurements converge on the same answer: not well enough.

IBM Research, AAAI 2026: LLM-as-Judge alone catches approximately 45% of code errors. Combined with deterministic tools — static analysis, type checkers, linters — the catch rate rises to 94%. That 45-to-94 gap is the circularity problem quantified. Half the errors invisible to AI review are visible to tools that check against rules rather than plausibility.

Martian Code Review Bench, February 2026: The first independent benchmark for AI code review — researchers from DeepMind, Anthropic, and Meta evaluating 17 tools across 200,000+ real pull requests. The best tools achieved F1 scores of 50–60%. Baseline, not breakthrough.

"Broken by Default" (ArXiv 2604.05292, April 2026): 3,500 code artifacts from seven LLMs, 500 security-critical prompts, verified by Z3 SMT solver. Results: 55.8% contained at least one formally proven vulnerability. The best model (Gemini 2.5 Flash) scored 48.4% — a D. The worst (GPT-4o) scored 62.4% — an F. Here's the structural finding: the same models identified their own vulnerable outputs 78.7% of the time in review mode, yet generated them at 55.8%. They can partially see their own bugs. They can't stop producing them. And industry SAST tools missed 97.8% of the formally proven vulnerabilities.

So the verification stack is leaky at every layer. AI generates vulnerable code at 55.8%. AI review catches 78.7% of it — but misses 21.3%. Industry static analysis catches almost nothing the formal methods find. The proposed fix reduces the problem. It doesn't solve it. And in a system producing code at the volumes Faros measured, reducing isn't enough.

What's Actually Happening

The most consequential finding in the Faros report isn't any single metric. It's this: engineering maturity is not a shield. Organizations with strong pre-AI engineering practices experienced the same downstream deterioration as everyone else. And as AI adoption deepened within organizations, the downstream metrics got worse, not better. The gap is widening.

This contradicts DORA's 2025 finding that high-performing teams would adapt. The Faros telemetry — real commits, real review times, real incident rates from 22,000 developers — says they didn't.

Lightrun's 2026 report found that 43% of AI-generated code changes need manual debugging in production even after passing QA and staging. Zero percent of the 200 SRE and DevOps leaders surveyed described themselves as "very confident" that AI-generated code would behave correctly once deployed. Not cautiously optimistic. Zero.

The system isn't waiting for AI review to mature. It's adapting the way systems always adapt under pressure: by dropping the expensive step. Review times explode, so teams skip review. Bugs rise, so debugging shifts to production. The fix comes too late and works too poorly to change this trajectory.

The Fix That Requires What's Being Cut

Zietsman's paper doesn't just diagnose the problem. It proposes an architecture: specifications first, deterministic verification pipeline second, AI review only for the structural and architectural residual — the things specifications can't capture.

This is almost certainly correct. The IBM data confirms it: deterministic tools plus AI review reaches 94%, versus 45% for AI alone. The tools that check against rules rather than plausibility are doing the heavy lifting. AI review adds value only when it's checking what formal methods can't — design intent, architectural coherence, the soft judgment layer.

But here's the problem that closes the loop: writing specifications, maintaining deterministic verification pipelines, designing the architecture that AI review checks against — this is exactly the work that's being cut. The same economic pressure that created the verification crisis (too much code, too few reviewers) also eliminates the people who would write the specifications the fix depends on. Faros measured review times quintupling. That pressure doesn't produce more specification writers. It produces more skipped reviews.

Anthropic's postmortem is the parable. They build the state of the art in AI coding tools. They launched an AI code review product. And three deliberate decisions to reduce compute — a reasoning effort downgrade, a caching change, a verbosity cap — degraded their own tool for six weeks while users reported problems and were initially told nothing was wrong. The verification tool couldn't verify itself. Not because the technology failed, but because the same pressures that create the need for verification (speed, cost, throughput) also degrade the capacity to verify.

The fix is circular not because AI can't review code. It can — partially, unreliably, with correlated blind spots. The fix is circular because it tries to solve a verification crisis with more of the technology that created the verification crisis, while the economic forces that created the crisis simultaneously destroy the infrastructure the fix requires.