Who Reviews the Reviewers? The AI Code Review Arms Race

Here is the situation: AI coding tools now generate 26.9% of production code. That code arrives in pull requests that are 154% larger than they used to be. Review time has increased 91%. And senior engineers now spend 3.6 times longer reviewing AI-generated suggestions than human-written code.

The bottleneck didn't disappear. It moved.

"Reviewing code looks the same as 3 years ago."

— Michael Truell, Cursor CEO. Fortune

So naturally, the market responded. AI code review is now a $4 billion market — up from $550 million in one year. At least eight major tools are competing for it. And the core dynamic is almost too neat: AI created the problem. AI is selling the fix.

The Bottleneck, Quantified

Faros AI measured 10,000+ developers across 1,255 teams. The picture is unambiguous.

Metric	Change	What it means
PRs merged	+98%	Nearly doubled output
PR size	+154%	Each PR is 2.5× larger
Review time	+91%	Nearly double per review
Bugs found	+9%	More bugs per PR
DORA metrics	Flat	No org-level improvement
Code bloat	6.4×	AI REST endpoint: 186 lines vs human 29

The math is brutal. Twice the PRs, each 2.5× bigger, taking twice as long to review — and the bugs went up, not down. DORA 2025 found the same thing: AI adoption now positively correlates with throughput but still negatively correlates with stability. Speed without stability is accelerated chaos.

Only 26% of senior engineers would ship AI code without review. The rest know what LogRocket put precisely: "You're not validating correctness. You're judging necessity."

Eight Tools, One Problem

The market response has been explosive. Here are the eight major players, mapped by what they actually bet on.

Tool	Bet	Key Metric	Price	Customers
Anthropic Code Review	Multi-agent depth	< 1% false positive rate	$15–25/PR	Uber, Salesforce, Accenture
BugBot (Cursor)	Scale + autofix	76% resolution, 35% merged unmodified	$40/user/mo	Rippling, Discord, Samsara
Copilot Code Review	Ecosystem lock-in	60M+ reviews, 1 in 5 on GitHub	Included	12K+ orgs, WEX, GM
Codex Security	Security audit	11K+ high-severity bugs found	Free 30 days	OSS community
CodeRabbit	Flat-rate volume	2M repos, 13M+ PRs	$24–30/mo	2M repos connected
Greptile	Full codebase context	82% bug catch rate	Contact	—
Graphite	Stacked PR workflow	24hr → 90min merge time	$40/user/mo	Shopify, Asana
Qodo	Enterprise context	450K dev hours saved/year (Fortune 100)	Enterprise	Fortune 100 retailer, monday.com

Notice the spectrum. On one end, tools like Anthropic Code Review and Greptile bet on depth — fewer reviews, higher fidelity, cross-file analysis. On the other, CodeRabbit and Copilot bet on coverage — review everything, fast, at low cost. In the middle, BugBot and Graphite are betting on workflow — not just finding problems but fixing them automatically.

Each bet reflects a theory about what's actually broken. Is the problem that we're missing bugs? That reviews take too long? That the fixes themselves need automation? The market hasn't decided yet.

Three Benchmarks, Three Different Winners

There's no consensus on who's best — partly because every benchmark tests something different.

The .NET Showdown: 38 issues, 3 tools

A DEV Community benchmark pitted Copilot, Claude, and BugBot against 38 planted issues in a .NET codebase.

Copilot

34/38

First pass. Best on security (18/18).

Claude

38/38

Multi-pass. Added severity tiers + merge recommendations.

BugBot

35/38

Only tool that caught its own regression.

Claude found everything — eventually. But it needed multiple passes. Copilot was best on security out of the box. BugBot's most interesting feature? It caught a bug it introduced in a previous fix. Self-correction is arguably more valuable than raw detection.

The 50-Bug Benchmark: Greptile's home court

Greptile's own benchmark tested five tools against 50 planted bugs. The results were wildly different from the .NET test.

Greptile

82%

Cursor

58%

Copilot

~54%

CodeRabbit

44%

Graphite

Source: Greptile benchmark. Note: Greptile's own benchmark — take the ordering with appropriate skepticism.

Full-codebase context (Greptile's approach) dramatically outperformed diff-only analysis. But this is also Greptile's own benchmark, which is like a restaurant rating itself five stars. The signal is in the gap: 82% vs. 44% suggests that understanding the whole codebase, not just the diff, matters enormously for catching real bugs.

The AIMultiple evaluation: 309 PRs

CodeRabbit scored 4/5 on correctness and actionability but just 1/5 on completeness. Fast and usually right — but it misses a lot. This is the speed-vs-depth tradeoff made visible: you can review every PR quickly, or you can review fewer PRs thoroughly. No tool has cracked both yet.

The Recursive Problem

Here's where it gets uncomfortable.

DryRun Security had three AI agents build two applications from scratch. Then they scanned the results — 38 scans across 30 PRs. The findings:

87% of AI-generated PRs contained at least one vulnerability

143 total issues across 10 vulnerability classes. Not outliers — structural risks.
Codex left the fewest (8 unresolved). Claude left the most high-severity (13). Gemini created the most issues overall.

"Security isn't part of their default thinking." — James Wickett, DryRun CEO

The implication is recursive: AI generates code with vulnerabilities. AI reviews that code. The reviewers themselves have blind spots. Apiiro measured this at Fortune 50 scale: 4× velocity, 10× vulnerabilities. 322% more privilege escalation paths. 153% more design-level flaws — the kind you can't patch with a quick fix.

And CodeRabbit's own analysis of 470 open-source PRs found AI-co-authored code had 1.7× more issues overall: 1.75× logic bugs, 1.64× maintainability problems, 1.57× security issues. The tools creating the code and the tools reviewing the code share the same fundamental limitation: they optimize for patterns, not understanding.

What Actually Works

Amid the recursive mess, some approaches show genuine results.

The cross-model pattern

WorkOS runs Cursor BugBot as the reviewer on every Claude Code PR. Different models catching each other's blind spots. "If you're using Claude Code to ship PRs and reviewing them yourself, you're leaving easy wins on the table." This adversarial setup — one AI generating, a different AI reviewing — outperforms either model reviewing its own output.

The layered defense

Snyk argues you need a deterministic SAST layer underneath the LLM review. Known patterns (SQL injection, XSS) should be caught by static analysis rules, not probabilistic models. LLMs should handle novel cross-file bugs that rule-based tools miss. Cursor's own security agents — catching 200+ real vulnerabilities per week across 3,000 internal PRs — work this way: specialized prompt-tuned agents for specific threat models, not general-purpose review.

The workflow bet

Graphite's data suggests the problem isn't review quality — it's review latency. Stacked PRs cut merge time from 24 hours to 90 minutes. When reviews happen on smaller, incremental changes instead of massive diffs, humans can actually keep up. The problem isn't that human review is too slow — it's that AI-generated PRs are too big.

The Real Scorecard

After three benchmarks, eight tools, and a mountain of data, here's what the numbers actually say:

What's improving

Anthropic: 16% → 54% PRs getting substantive comments
BugBot: 52% → 76% resolution rate
Copilot: 60M+ reviews, 1 in 5 GitHub reviews
False positive rates dropping across the board

What isn't

87% of AI PRs still contain vulnerabilities
No tool catches > 82% of bugs
Design-level flaws up 153% — undetectable by most tools
48% of devs don't verify AI code before committing

The honest assessment: these tools are getting better fast. Anthropic's multi-agent approach and Cursor's autofix are genuine innovations. Copilot's scale — one in five GitHub reviews — is creating a new baseline. But the gap between "pretty good at catching syntax bugs" and "actually understands whether this code should exist" remains enormous.

The deepest problem isn't technical. It's structural. As Anthropic itself noted: "developers are stretched thin, and many PRs get skims rather than deep reads." AI review tools don't fix this. They add another layer of skimming. The code gets generated by AI, reviewed by AI, approved by a developer who glances at the AI's summary. The human in the loop is becoming a rubber stamp.

The tools that will win aren't the ones that find the most bugs. They're the ones that force the question the developer should have asked before writing the prompt: does this code need to exist at all?