6 min read

61% Correct, 10.5% Secure: The Verification Gap That Defines AI Coding in 2026

61% Correct, 10.5% Secure: The Verification Gap That Defines AI Coding in 2026

In January 2026, Georgia Tech's Systems Software & Security Lab found six CVEs in AI-generated code. In February, fifteen. In March, thirty-five.

That's not linear growth. That's the beginning of an exponential curve — and we're at the bottom of it.

The SSLab's Vibe Security Radar project has been methodically tracking vulnerabilities introduced by AI coding tools since May 2025. Their methodology: pull from public vulnerability databases (CVE.org, NVD, GHSA, OSV, RustSec), trace each fix back to its introducing commit, check for AI tool signatures — co-author tags, bot emails — then use AI agents to verify whether the tool actually caused the bug.

Seventy-four confirmed CVEs so far. Fourteen critical. And researcher Hanqing Zhao says the real number is 5 to 10 times higher — roughly 400 to 700 cases across the open-source ecosystem — because most developers strip AI traces before committing.

The Scale of What We're Generating

Claude Code alone has generated 30.7 billion lines of code in the past 90 days and accounts for more than 4% of all public GitHub commits — over 15 million total. That's one tool. Add Copilot, Cursor, Devin, Jules, and the dozens of others, and the picture is clear: AI-generated code is no longer an experiment. It's a substantial fraction of the global codebase.

The generation layer works. Nobody disputes this. PRs per author are up 20% year-over-year. Deployment frequency is climbing everywhere. The tools are fast, getting faster, and increasingly autonomous.

The question was never whether AI could write code. The question is whether anyone is checking it.

The Evidence: Without Review

+23.5%
Incidents per PR, YoY
Cortex 2026 Benchmark
+30%
Change failure rate, YoY
Cortex 2026 Benchmark
1.7x
More issues in AI-authored PRs
State of AI vs. Human Code, 470 PRs
+39%
Cognitive complexity increase
Agent-assisted repos after 6 months

These aren't cherry-picked stats from a single study. Cortex surveyed 50+ engineering leaders and found PRs going up while incidents climbed faster. A separate analysis of 470 pull requests (320 AI-co-authored, 150 human-authored) found AI code carries 1.75x more correctness errors, 1.64x more maintainability issues, and 1.57x more security problems. In repos with heavy agent use, static analysis warnings rose 18%, cognitive complexity jumped 39%, and initial velocity gains disappeared after six months as tech debt accumulated 30-41% faster.

Then there's the security data that should make every engineering leader pause:

322% more privilege escalation paths. 153% more design flaws. AI-generated code often passes review because it looks correct. The bugs aren't typos — they're architectural.

The SusVibes Result

The number that crystallizes the problem comes from a benchmark called SusVibes, built from 200 real-world feature requests that previously led human programmers to write vulnerable code. When SWE-Agent with Claude Sonnet tackled these tasks: 61% of solutions were functionally correct. Only 10.5% were secure.

Read that ratio again. Six out of ten solutions work. One in ten is safe to deploy.

This isn't a model problem. Every agent tested performed poorly on security — even when researchers fed vulnerability hints directly into the prompts. The models can write functional code. They cannot reliably reason about the security implications of what they write. The failure mode isn't "broken code." It's code that works perfectly until someone exploits it.

The CVE Leaderboard Nobody Wanted

Tool Confirmed CVEs Critical Note
Claude Code 49 11 4%+ of public commits, always leaves signatures
GitHub Copilot 15 2 4.7M subscribers, weaker traceability
Google Jules 2 1 Newer, lower volume
Devin / Aether / Cursor 2 each 0 Agent tools, limited traceability
Atlassian Rovo / Roo Code 1 each 0 Enterprise/community tools

Source: Vibe Security Radar, Georgia Tech SSLab. Data as of March 20, 2026. Note: Claude Code's dominance reflects market share and signature persistence, not necessarily higher per-commit vulnerability rate.

The important caveat: Claude Code always leaves co-author metadata. Most other tools don't, or developers remove it. Claude Code's high count reflects the fact that it's the most traceable, not necessarily the most dangerous per line of code. Zhao estimates the true vulnerability count across all tools is 400-700 — the 74 confirmed are just the ones that left fingerprints.

The Evidence: With Review

Here's where the story turns. Because we also have data on what happens when teams invest in the verification layer.

Qodo's survey data: 81% of teams using AI code review report quality improvements, versus 55% without. Teams with AI review see 2x the quality gains (36% vs 17%) — even without any speed improvement.

GitHub's Octoverse report: repositories with AI-assisted review had 32% faster merge times and 28% fewer post-merge defects.

In March 2026, a research lab called Martian released the Code Review Bench — the first independent benchmark of AI code review tools, built by former DeepMind, Anthropic, and Meta researchers with no product to sell. They analyzed 200,000+ real pull requests across 17 tools, measuring not what the tools flagged but what developers actually acted on.

Tool F1 Precision Recall
Qodo Extended 64.3% 62.3% 66.4%
Augment 53.8% 47.0% 62.8%
CodeAnt AI 51.7% 52.2% 51.1%
Cursor Bugbot 44.9% 46.2% 43.8%
Claude Code Reviewer 39.0%
GitHub Copilot 35.5%

Source: Martian Code Review Bench, March 2026. F1 balances precision (comment accuracy) and recall (issue coverage). Higher is better. Caveat: offline test set was partly constructed using Augment and Greptile data, introducing potential structural bias.

The best tool catches two-thirds of real issues. The worst catches a third. None are perfect. But the difference between having automated review and not having it is stark: 28% fewer post-merge defects, 2x quality gains, and the gap between those numbers and the incident curves above.

The Governance Gap

Here's what ties it together. Cortex found that nearly 90% of engineering teams actively use AI coding tools. Of those:

32%
Formal policies with enforcement
41%
Informal guidelines only
27%
No governance at all

Two-thirds of teams using AI coding tools have no formal, enforced policies around code quality governance. The generation tools are sophisticated, fast, and everywhere. The verification layer is absent or improvised.

This is the gap. This is what the numbers are screaming about.

The NCSC Weighs In

On March 24, the UK's National Cyber Security Centre CEO Dr. Richard Horne stood at the RSA Conference and called for "vibe coding safeguards." The accompanying NCSC blog post was blunt: AI-generated code currently presents "intolerable risks" for many organizations.

Their prediction: within five years, AI-written code in production systems that a human has never reviewed or even looked at will be common. Their recommendation: secure-by-default practices baked into AI models, provable model provenance, and — critically — AI tools auditing all code, not just human-written code.

When a national cyber security agency starts issuing guidance on your development practices, the problem is no longer theoretical.

Cheap to Generate, Expensive to Validate

Every data point in this piece points at the same structural dynamic. I've been calling it the verification gap, but Cortex named it better: cheap to generate, expensive to validate.

This is the defining economic reality of AI-assisted development in 2026. Generation costs are plummeting — models are faster, cheaper, more autonomous every quarter. Verification costs are not. If anything, they're rising, because the volume and subtlety of AI-generated bugs require increasingly sophisticated review.

The companies getting this right invest in the review layer as aggressively as they invest in the generation layer. They treat AI code review not as an optional nice-to-have but as essential infrastructure — the way CI/CD became essential infrastructure a decade ago. The data says those companies get 2x the quality gains and 28% fewer post-merge defects.

The companies getting this wrong are shipping faster, breaking more, and accumulating a technical debt burden that the six-month data already shows erasing their velocity gains. Their incident rates are climbing. Their change failure rates are climbing. And somewhere in their codebase, there are vulnerabilities that functional tests will never catch because the code works — it just isn't safe.

Sixty-one percent correct. Ten point five percent secure. That ratio is the whole story.