The 67% Rejection Rate: What Happens When AI Agents Ship Rea

Your AI coding agent has a half-life.

Toby Ord, the Oxford philosopher best known for existential risk research, spent the last year studying something more immediate: how long AI agents can work before they break. His analysis of 170 METR benchmark tasks found that agent success follows exponential decay — like radioactive isotopes, each agent has a characteristic half-life, a fixed interval after which its probability of still succeeding drops by half.

Claude 3.7 Sonnet's half-life: roughly 59 minutes. A one-hour task succeeds about half the time. A two-hour task: 25%. Four hours: 6.25%. Eight hours: less than one percent. The math is merciless and the curve is universal — every model tested followed the same shape, just with different decay constants.

This isn't a bug. It's the physics of compounding failure applied to systems that can't perfectly self-correct. And it explains the most important number in AI-assisted software engineering right now.

The Number

LinearB analyzed 8.1 million pull requests across 4,800 engineering teams in their 2026 Software Engineering Benchmarks Report. The finding: AI-generated PRs are accepted at a rate of 32.7%. Human-written PRs: 84.4%.

That's a 67.3% rejection rate for AI code in production.

The secondary findings are equally telling. AI-generated PRs wait 4.6 times longer before anyone picks them up for review. Once someone does start reviewing, they finish 2x faster than with human code — not because the code is cleaner, but because reviewers spend less effort on code they don't trust. The pattern is clear: teams are generating more code than ever and trusting less of it than ever.

These aren't toy benchmarks. This is 8.1 million real pull requests at real companies shipping real products. Two out of three AI-written PRs get rejected. The question is why — and Ord's half-life model gives us the framework to answer it.

The Math That Explains Everything

Consider a modest AI coding task: twenty steps involving reading files, understanding context, writing code, running tests, and fixing errors. Assume each step succeeds with 95% probability — generous for current models. The probability of completing all twenty steps correctly:

0.95²⁰ = 0.358

A 36% success rate. Sound familiar?

This is the compounding error problem. At 95% per-step accuracy over 20 steps, you land almost exactly at LinearB's 32.7% acceptance rate. The math isn't coincidence — it's the mechanism.

The curve above shows why some organizations succeed with AI agents while most don't. It's not about model quality — it's about where on the curve your tasks sit. Stripe keeps tasks in the first few steps, where success probability is still high. The average company throws broad, multi-step tasks at agents and lands in the decay zone. Amazon, before its 90-day safety reset, was operating at the far right of the curve — long chains of unsupervised agent actions with no governance.

The APEX-Agents benchmark tells the same story from a different angle. Across 480 realistic professional tasks requiring multi-step reasoning, the best frontier models achieved just 24% first-attempt success. Running eight attempts raised the best results to about 40% — which means that even with aggressive retry strategies, most complex tasks remain unsolved.

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. Only about 130 of the thousands of "agentic AI" vendors are real, they estimate — the rest are "agent washing," rebranding chatbots and RPA tools.

Who Beat the Math

Three organizations stand out for having genuine, verified success with AI agents in production. What they share is instructive.

Stripe ships 1,300 AI-generated pull requests per week through their internal Minions system — a fork of Goose running on Claude. Zero human-written code in those PRs. That number sounds like it contradicts the 67% rejection story, until you understand how they built it.

Stripe's architecture treats the agent as one node in a larger deterministic system. Each "blueprint" workflow combines hard-coded steps (clone repo, set up environment, run specific tests) with agentic steps (write the code, interpret error messages). The agent never operates in open-ended mode. It gets a narrow, well-specified task with a devbox environment spun up in 10 seconds, access to over 500 tools via MCP, and a two-retry maximum before escalation. Three tiers of feedback — linting, 3 million+ automated tests, and human review — catch errors before they compound.

"The primary reason the Minions work has almost nothing to do with the AI model. It has everything to do with the infrastructure Stripe has built for developers over the years."
— Alistair Gray, Stripe Engineering, February 2026

In half-life terms: Stripe moved each agent task from the right side of the decay curve to the left. Smaller tasks, more checkpoints, faster failure detection. They didn't beat the math — they changed the inputs.

Nubank reported 8-12x efficiency gains and 20x cost savings on a 6-million-line ETL migration. Those numbers are verified and real. They're also narrow. ETL migrations are monotonous, highly repetitive transformations with clear input-output contracts — the ideal task shape for agents. The same approach would not generalize to feature development or debugging novel issues.

Rakuten achieved 99.9% accuracy across 12.5 million lines of code transformation — again, on well-defined, rule-governed migrations. Pattern: verified success with agents correlates tightly with task narrowness multiplied by infrastructure depth. Not with model quality.

Who Didn't

Amazon's AI coding deployment resulted in 6.3 million lost orders and triggered a 90-day safety reset across the company. The failure wasn't in the model — it was in deploying agents without governance structures, without staged rollouts, without the kind of infrastructure that Stripe spent years building for human developers.

Amazon's VP of engineering, James Gosling, called the ROI calculus "disastrously shortsighted." The company had optimized for velocity — how fast can agents ship code — without investing in the systems that make velocity safe.

The average company sits somewhere between Amazon's catastrophe and Stripe's success. Opsera's data shows AI code requiring 15-25% rework, eating 30-40% of the productivity gains. Aikido Security found 69% of organizations discovered AI-introduced vulnerabilities, with one in five causing material business impact. Trust is declining: 50% of developers now actively distrust AI outputs, up from 31% in 2024.

Meanwhile, the review bottleneck grows. Anthropic's own data shows code output up 200% since agents were deployed internally — but that output still needs human review. The constraint has shifted from code generation to code evaluation, which is why Anthropic launched Code Review as a product and why HubSpot built "judge agents" to evaluate other agents' work. The production problem isn't generating code. It's knowing which code to trust.

From a broader economic lens: an NBER study from February 2026 surveying thousands of firms found that 89% report zero measurable productivity change from AI adoption. The gains are real at the individual task level, but they're not yet showing up in firm-level metrics. The gap between individual acceleration and organizational output is the governance gap.

The Infrastructure Thesis

Every data point in this article converges on the same conclusion: the difference between AI agents that work in production and AI agents that don't has almost nothing to do with the model.

Stripe's Minions run on Claude — the same foundation model available to every company. Factory.ai's Droids are model-agnostic. Nubank's 12x gains came from task design, not model selection. The SWE-bench leaderboard shows the top five models within 0.9 percentage points of each other. The model is a commodity. The infrastructure is the product.

What "infrastructure" means concretely:

Task decomposition — breaking work into steps small enough to stay on the left side of the decay curve. Stripe's blueprints. Nubank's per-file transformations. The smaller the task, the higher the success probability.

Fast feedback loops — catching errors before they compound. Stripe's three-tier system (lint → test → human review) means a failure at step 3 doesn't propagate to step 20. Each checkpoint resets the decay curve.

Bounded autonomy — hard limits on retries, scope, and blast radius. Stripe's two-retry maximum. Devbox isolation. The agent doesn't get to keep trying until it produces something plausible — it escalates.

Pre-existing engineering excellence — Stripe had 3 million tests, comprehensive CI/CD, and isolated development environments before AI agents existed. They built that infrastructure for human developers. The agents inherited it.

This is the uncomfortable truth the industry doesn't want to hear. You can't buy your way to agent-driven development with a better model or a more expensive tool. The companies succeeding with agents were already the best engineering organizations. AI didn't create their excellence — it amplified what was already there. And for everyone else, AI amplified their deficiencies with equal enthusiasm.

NIST launched its AI Agent Standards Initiative in February 2026, focusing on agent identity, authentication, and security protocols. But only 14.4% of organizations deploy agents with full security approval. The 40% of deployed agents running with zero safety monitoring, per the MIT AI Agent Index, are operating at the far right of the decay curve with no checkpoints at all.

The Half-Life Is the Message

Toby Ord's half-life concept reframes the entire AI agent conversation. The question isn't "how good is the model" — it's "how long does each task take, and how many checkpoints exist along the way."

A 59-minute half-life means a system that succeeds half the time on hour-long tasks. That sounds bad until you realize that a 5-minute task, under the same model, succeeds 94% of the time. The half-life is a constant for any given system. What changes is the architecture around it — how you decompose tasks, how often you verify, how quickly you recover from failures.

Stripe didn't beat the exponential decay with a better model. They beat it by making each step smaller, each failure cheaper, each recovery faster. The 67.3% rejection rate isn't an AI problem. It's an infrastructure problem that AI made impossible to ignore.

The companies that will win the next phase of AI-assisted development aren't the ones with the best models or the most aggressive deployment strategies. They're the ones with the best plumbing — the test suites, the CI/CD pipelines, the code review processes, the governance structures that were built for humans and now serve as the scaffolding that keeps agents from falling off the curve.

The half-life will improve. Models will get more reliable. But the decay curve never goes away — it just shifts. And the organizations that invested in infrastructure will capture that improvement. The ones that didn't will discover, again, that faster failure is still failure.