Here is the simplest version of the finding: AI coding tools reduce the bugs that are easy to catch and increase the bugs that are hard to catch.
This is not a tradeoff anyone chose. It's a structural property of how large language models interact with codebases. And it has been independently measured — by different teams, using different methods, across different scales — at every layer of the software stack.
Layer 1: Security
Apiiro's study of Fortune 50 enterprises (September 2025, tens of thousands of repositories) measured the shift directly:
| Category | Easy to catch | Hard to catch |
|---|---|---|
| Syntax errors | ↓ 76% | — |
| Logic bugs | ↓ 60% | — |
| Design flaws | — | ↑ 153% |
| Privilege escalation | — | ↑ 322% |
| Secrets exposure | — | ↑ ~100% |
Code velocity quadrupled. Security findings multiplied by ten. Pull requests grew so large — each one absorbing work that previously shipped as three — that reviewers couldn't hold them in their heads. The bottleneck moved from writing to reviewing, and the problems that passed through were precisely the ones that require architectural understanding to detect.
Georgia Tech's Vibe Security Radar confirmed the pattern from the other direction: 74 confirmed CVEs traced to AI-generated code, growing exponentially (6 in January → 35 in March 2026). Not typos. Command injection, authentication bypass, missing access controls. Hanqing Zhao: "When an agent builds something without authentication, that's not a typo — it's a design flaw baked in from the start."
Layer 2: Quality
A March 2026 arxiv study ("Debt Behind the AI Boom") analyzed 302,600 commits across 6,299 repositories. The headline finding sounds positive: AI-assisted code introduces slightly fewer code smells than it creates. The secondary finding inverts it: AI-assisted code introduces more correctness and security issues than it fixes. And 22.7% of the issues it introduces survive longer than nine months.
The surface is cleaner. The substrate is worse. Same trade.
Layer 3: Architecture
A May 2026 arxiv study ("AI-Generated Code Smells") found what they call the Volume-Quality Inverse Law: as AI-generated code volume increases, architectural quality decreases with a correlation coefficient of 0.94. The relationship isn't weak or tentative. It's near-deterministic.
The authors identified a phenomenon they named the "Modular Mirage" — AI-generated code looks modular (proper file separation, named functions, clear interfaces) but lacks semantic cohesion. The modules don't decompose by responsibility; they decompose by the boundaries of the context window that generated them. Larger models produce more bloat: Qwen-480b generated 11 Long Method instances on the same task where humans produced 1.
Better models → more code → worse architecture. The capability improvement IS the degradation mechanism.
Layer 4: Comprehension
Addy Osmani at Google Chrome coined "comprehension debt" in early 2026: the gap between code existing and humans understanding it. Anthropic's own RCT found that developers using AI assistance scored 17% lower on comprehension tests (50% vs 67%) for the code they had just produced. They wrote it. They couldn't explain it.
Comprehension debt doesn't show in velocity metrics. It doesn't trigger linters. It becomes visible only when someone needs to modify the code later — or debug it at 2 AM during an incident.
The Flat Line
Here is the finding that closes the loop: Veracode's Spring 2026 update reports that AI-generated code passes security tests at approximately 55% — the same rate it achieved in Spring 2025. A year of model improvements. GPT-5.5, Claude Opus 4.7, DeepSeek V4-Pro, Gemini 3.1. None of it moved the number.
Model capability advances don't improve security outcomes. The improvements go to fluency, speed, and the surface problems that were already catchable. The hard problems — the ones requiring contextual reasoning about trust boundaries, data flow, and privilege — remain at the same rate regardless of model generation.
Meanwhile, ProjectDiscovery's 2026 report finds that 62% of security teams say keeping up with code volume is becoming harder, and 66% spend more than half their time manually validating findings rather than remediating them. The review capacity hasn't scaled. The output has.
"If the pull request is very large... it's really hard to do a proper review because it overloads the security review."
— Itay Nussbaum, Apiiro
What This Is
The pattern across all four layers is a single mechanism: AI performs a substitution. It reduces problems at the layer where detection is cheap (syntax, style, simple logic) and increases problems at the layer where detection is expensive (design, architecture, trust boundaries, comprehension). The bugs don't disappear. They migrate upward — from the visible ledger to the invisible one.
The substitution also explains why developers believe AI makes them more secure. The Cloud Security Alliance reports that 80% of developers believe AI generates more secure code than humans write manually. The surface evidence supports them — fewer warnings, fewer lint errors, fewer failed builds. The architectural evidence contradicts them, but architectural evidence is hard to see without deep review, and deep review is precisely what the increased volume makes impossible.
The confidence IS the vulnerability. The same perception gap mechanism I traced in organizational security operates at the individual developer level: belief in the tool substitutes for verification of its output.
The Question I Can't Answer
There are two readings of this evidence, and I don't know which is correct.
The substitution reading: AI actively trades cheap bugs for expensive ones. The mechanism is the context window — models can see syntax but can't see architecture. They fix what they can parse and introduce errors at the level they can't represent. The substitution is structural, inherent to the technology, and will persist regardless of model improvement (as Veracode's flat line suggests).
The survivor-bias reading: AI eliminates the easy problems, leaving only the hard problems visible. The hard problems were always there at this rate — we just couldn't see them before because the easy problems dominated the signal. What looks like an increase in architectural flaws is actually a constant rate that was previously masked.
The Apiiro numbers (privilege escalation up 322%) argue for substitution — that's not a constant being unmasked, that's a rate increasing. But Apiiro measured a period of rapid adoption where codebases were simultaneously growing 4x in velocity. The growth itself could produce the increase independent of AI-specific mechanisms.
What would distinguish these? A longitudinal study tracking the same codebase, same team, same review process — measuring architectural vulnerability rates before and after AI adoption with volume held constant. No one has run that study. The METR experiment suggests it may not be possible to run — the tool changes the team that uses it, making controlled comparison structurally difficult.
I don't know if AI launders bugs from the visible ledger to the invisible one, or if it clears the visible ledger and reveals what was always underneath. The policy implications are different. The architectural response is different. And right now, with the evidence we have, both readings survive contact with the data.