7 min read

Where AI Coding Actually Works

Where AI Coding Actually Works

Everyone asks the same question: Which AI coding tool should we use?

The data says it barely matters.

Stripe ships 1,300 pull requests per week with zero human-written code. Intercom built a custom model that beats GPT-5.4 at a fifth of the cost. Some companies using AI see a 50% drop in customer-facing incidents. Others see a 2x increase. Same tools. Same models. Different outcomes.

After four weeks of covering what goes wrong with AI coding — the verification gap, the reliability tax, the reliability gap, the productivity paradox — I went looking for the positive case. What I found doesn't contradict the negative case. It explains it.

Three Companies That Got It Right

Stripe: 1,300 PRs a Week, Zero Human-Written Code

Stripe's Minions are autonomous coding agents that take a task description — from Slack, a bug report, a feature request — and produce a complete pull request: code, tests, documentation. Every PR is human-reviewed. None are human-written.

The architecture is what makes it work. Stripe calls it "blueprints" — a hybrid of deterministic nodes (git operations, linting, CI) and agentic nodes (implementation, bug fixing). Each Minion runs in a devbox — a standardized EC2 instance with Stripe's full source tree, warmed Bazel caches, and type-checking services, provisioned from a warm pool in under 10 seconds. The agent harness is a heavily modified fork of Block's open-source Goose, stripped of everything interactive and optimized for one-shot completion.

The agent connects to Toolshed, Stripe's centralized MCP server, which exposes nearly 500 tools spanning internal systems and third-party platforms. Different agents request only task-relevant subsets rather than loading the full catalog. Directory-scoped rule files attach automatically as the agent traverses the filesystem — and in a clever interoperability move, Stripe adopted Cursor's rule file format and synchronized it across three separate agent systems: Minions, Cursor, and Claude Code.

A two-retry limit on CI failures prevents infinite loops. Two shots and a human handoff is the sweet spot. As Stripe's engineers put it: "there are diminishing marginal returns if an LLM is running against indefinitely many rounds of a full CI loop."

But here is the part that matters most:

None of this was built for AI.

Stripe's CI pipeline, Sorbet type system, devbox infrastructure, testing culture, and code review process were all built for human engineers years before any LLM existed. Minions inherit an environment that was already engineered for correctness. The AI is a new tenant in a building that was already up to code.

Minions work best on well-defined tasks: config changes, dependency upgrades, minor refactoring, flaky test fixes. The codebase is hundreds of millions of lines, mostly Ruby with Sorbet typing — a relatively uncommon stack that LLMs have never encountered in training data, full of homegrown libraries. The types and the tests compensate for the model's ignorance.

Intercom: The Custom Model That Beat the Frontier

Intercom's Fin Apex 1.0 is a customer service AI built on an undisclosed open-weights base, post-trained on billions of proprietary customer interactions. It resolves 73.1% of conversations without human intervention — compared to 71.1% for GPT-5.4 and Claude Opus 4.5, and 69.6% for Claude Sonnet 4.6.

It also responds in 3.7 seconds (0.6s faster than its nearest rival), hallucinates 65% less, and runs at roughly one-fifth the cost of direct frontier models. One gaming customer saw resolution rates jump from 68% to 75% overnight — a 22% reduction in unresolved conversations, the largest single improvement Intercom has ever recorded.

CEO Eoghan McCabe: "Pre-training is kind of a commodity now. The frontier, if you will, is actually in post-training."

Intercom's Fin agent handles over 2 million conversations per week and is approaching $100 million in annual recurring revenue at 3.5x growth. The 60-person AI team that built it grew from 6 researchers over three years. The model is available only through Fin — $0.99 per resolved interaction, no standalone API.

This is not a coding agent. But the lesson is universal: domain expertise + proprietary data + post-training > frontier general model. Intercom didn't buy the best model. They built the best system.

The Quantitative Evidence: 135,000 Developers

DX's Q1 2026 study of 135,000+ developers across 4.2 million delivers the numbers. 91% of developers have adopted AI tools. 26.9% of production code is now AI-authored. Daily users see 60% more PR throughput (2.3 vs 1.4 PRs/week) and save 3.6 hours per week. Staff+ engineers save 4.4 hours — more than juniors.

Then the data splits.

50%
drop in customer-facing incidents
Companies with structured AI enablement
2x
increase in customer-facing incidents
Companies without structured enablement

Same tools. Same models. Same adoption rates. The difference is entirely organizational: companies with structured enablement programs see 8% better code maintainability and 19% less developer time loss. The ones without structured enablement see the opposite.

DX also found something that should reframe the entire conversation: "Meetings, interruptions, review delays, and CI wait times cost developers more time than AI saves." The bottleneck was never the code generation. It was everything around it.

The Pattern

Faros AI studied 10,000 developers across 1,255 teams. High AI adoption correlated with 21% more tasks completed and 98% more PRs — but also 91% more review time. Amdahl's Law applied to software: the system moves only as fast as its slowest link, and the slowest link is everything that isn't code generation.

DORA's 2025 report made the structural argument explicit:

"The success of AI depends less on the sophistication of tools and more on the organizational systems surrounding them."

— DORA 2025 State of DevOps Report

DORA identified seven capabilities that determine whether AI amplifies strength or exposes weakness: clear AI strategy and governance, healthy data ecosystem, platform engineering (the critical enabler), user-centric development, security collaboration, training programs, and communities of practice. They also introduced a fifth metric — rework rate — and found that AI acts as a multiplier of existing conditions. Good teams get better. Fragile teams break.

McKinsey's 2025 data puts a dollar figure on it: companies that redesigned workflows before selecting AI tools were 2x more likely to report significant returns. Only 6% of all firms report 5%+ EBIT improvement from AI. The 6% didn't have better tools. They had better foundations.

The pattern is now visible across every success story:

Company AI Result What Was Already There
Stripe 1,300 PRs/week, zero human-written Sorbet types, comprehensive CI, devbox infrastructure, testing culture, code review process — all built for humans
Intercom 73.1% resolution, beats GPT-5.4 Billions of proprietary interactions, 60-person AI team grown from 6, domain-specific evals, outcome-based pricing
DX Top Quartile 50% fewer incidents Structured enablement programs, maintainability standards, developer experience investment
McKinsey 6% 5%+ EBIT improvement Workflow redesign completed before tool selection
DORA High Performers AI amplifies speed + stability Platform engineering, governance, training programs, communities of practice — all seven capabilities

The right column is the story. Every company that succeeded with AI coding had already built the infrastructure that makes any tool work — type systems, CI pipelines, review processes, testing cultures, platform engineering, governance, enablement programs. The AI didn't build the foundation. The AI stood on it.

The Uncomfortable Implication

This is not a different story from the one I've been telling for four weeks. It's the same story from the other end.

The companies failing at AI coding aren't failing because of the models. They're failing because they never built the infrastructure that makes any tool work — AI or otherwise. The verification gap exists because there was no verification layer before AI arrived. The reliability tax is high because there's no existing reliability infrastructure to absorb it. The productivity paradox — 89% of firms seeing zero impact — exists because 89% of firms were already running on weak foundations.

AI is a mirror. Stripe looks into it and sees 1,300 PRs a week. Amazon looks into it and sees increased outages. The reflection is honest. The mirror isn't the problem.

This is what DORA means by "AI amplifies existing conditions." It is what Stripe's Minions prove by running on infrastructure built for humans. It is what Intercom proves by beating frontier models with domain data. It is what DX proves with the 50%-vs-2x divergence. The positive case and the negative case are the same case.

What the Positive Case Actually Demands

If you want AI coding to work, the prescription isn't "buy a better model." The top five models on SWE-bench Verified are within 0.9% of each other. The prescription is:

1.
Build the verification layer first. CI that catches real problems, type systems that constrain the model's output space, tests that run automatically. Stripe's Sorbet types mean Minions can't ship type-incorrect code. That constraint is worth more than any model upgrade.
2.
Invest in platform engineering. DORA identified it as the critical enabler. Stripe's devboxes, Toolshed MCP server, and rule file synchronization are platform engineering. The agent needs an environment that's already well-organized.
3.
Redesign workflows before selecting tools. McKinsey's data is unambiguous: 2x ROI difference. Choose the process, then choose the tool. Not the reverse.
4.
Scope AI to well-defined tasks. Stripe's Minions excel at config changes, dependency upgrades, refactoring, and flaky test fixes — not greenfield architecture. The sweet spot is tasks with clear specs, testable outputs, and bounded scope.
5.
Measure outcomes, not adoption. DX's divergent results come from the same adoption rate. NBER found 89% of CEOs see no productivity impact despite 69% adoption. The metric that matters isn't how many developers use AI — it's what the pipeline produces.

None of this is glamorous. There is no shortcut in this list, no silver bullet, no tool that solves the problem for you. That's the point. The companies that got AI coding right didn't find a better tool. They were already the kind of company where tools work.

The question was never which tool. It was always what did you build before the tool arrived.