8 min read

The Perception Gap

The Perception Gap

In July 2025, METR published a study that should have ended the conversation. Sixteen experienced open-source developers completed 246 tasks in codebases where they had an average of five years of prior experience. With AI tools, they were 19% slower.

That was the data. Here was the perception: before the experiment, developers predicted AI would speed them up by 24%. After experiencing the slowdown — after living through 19% more time on every task — they reported that AI had sped them up by 20%.

A 39-percentage-point gap between what happened and what developers believed happened.

That was one study. Now there are five.

The Evidence Chain

Between July 2025 and April 2026, five independent research groups measured the same phenomenon from different angles, using different methods, across different populations. None of them set out to find a perception gap. All of them did.

Study Sample What developers said What the data showed
METR 16 devs, 246 tasks +20% faster -19% slower
JetBrains HAX 800 devs, 151M events 80%+ reported productivity gains More deletion, more context switching, no debugging change
BNY Mellon 2,989 devs 86% satisfied 60% saving <1hr/week. r=0.34
Anthropic RCT 52 engineers Task completion faster 17% lower comprehension scores
Faros 22,000 devs 21% more tasks completed PR review time +91%. Org delivery flat

Five studies. Five different methodologies — randomized controlled trials, telemetry at scale, survey triangulation, longitudinal mixed-methods, production engineering metrics. 26,000 developers across all of them. Every single one found the same thing: developers report gains that the data does not support.

This isn't a sampling artifact. It isn't a quirk of one methodology. It's a structural feature of how humans experience AI assistance.

What 151 Million Events Actually Show

The JetBrains HAX study is the most methodologically rigorous of the five, and the most unsettling. Eight hundred developers, split evenly between AI users and non-users, tracked over two years through 151 million logged IDE events — every keystroke, every deletion, every undo, every window switch, every debugging session. Then surveyed. Then interviewed.

The behavioral data told one story. The developers told another.

AI users typed approximately 600 more characters per month than non-users. They also deleted more. They undid more. A referenced study found that roughly 20% of initially accepted AI-generated code was later deleted, and about 7% was heavily rewritten. The pattern is not "write more, ship more." It's "generate more, discard more."

Context switching — the thing every developer complains about, the thing AI tools promise to reduce — went up, not down. AI users showed increased IDE window activations, more frequent switching between tools and tasks. The perception? Expected reduction. The data? The opposite.

Debugging behavior didn't change at all. Not more, not less. The same sessions, the same patterns, the same time spent. If AI were genuinely improving code quality, you'd expect fewer debugging sessions. If it were degrading quality, more. Neither happened. The debugging load simply persisted regardless of AI use.

And yet: over 80% of these same developers reported that their productivity had increased. Fifty percent said their coding time had decreased. Only two out of sixty-two survey respondents reported any decrease at all.

The researchers' conclusion deserves to be quoted in full:

"AI redistributes and reshapes developers' workflows in ways that often elude their own perceptions."
— JetBrains HAX, ICSE 2026

r = 0.34

At BNY Mellon, researchers used the DX survey framework — the same framework used by DORA and SPACE — to measure both satisfaction and time savings across 2,989 engineers. The result was a Pearson correlation of r = 0.34 between satisfaction with GitHub Copilot and reported time savings.

For non-statisticians: r = 0.34 means satisfaction explains about 12% of the variance in time savings. The other 88% is something else entirely. Developers aren't satisfied because the tool saves them time. They're satisfied and the tool sometimes saves them time, but the two things are mostly unrelated.

The distribution makes this concrete. Approximately 400 developers reported being "very satisfied" despite saving only 30 minutes per week. Over 100 developers saving 2+ hours per week were neutral or dissatisfied. Satisfaction and productivity are not just weakly correlated — they are functionally independent measurements that happen to share a positive direction.

When the researchers conducted follow-up interviews, the mechanism surfaced. They identified six distinct factors that developers use to evaluate AI productivity — and only one of them (task completion rate) maps to anything an organization would recognize as "output." The others:

Self-sufficiency Reduced need to ask colleagues. "I never visit Stack Overflow now." This feels like productivity. It might just be isolation.

Reduced frustration Less time stuck. Less cognitive friction. But non-deterministic AI outputs create their own friction: developers reported asking "four or five times to get the correct answer, even if you ask the same thing."

Task completion rate The only factor that actually measures output.

Peer review ease Easier to write review-ready code. But Faros found PR review time increased 91% at organizations with high AI adoption — the reviewers disagree.

Technical expertise Long-term risk. Recognized by developers themselves: "if the code just works, then you just accept it."

Ownership of work The felt sense of authorship. "Nothing like doing it yourself." Developers value ownership — and recognize they're losing it.

Satisfaction measures feelings about the experience. Productivity measures output per unit time. When organizations survey developers about AI tools, they are measuring the first and reporting the second.

The Invisible Skill Tax

Anthropic's own randomized controlled trial — the company that builds Claude — found the deepest cut. Fifty-two junior engineers learning a new library. The AI-assisted group completed tasks. They also scored 17% lower on comprehension tests afterward.

But the split within that group was more revealing. Engineers who used AI to ask conceptual questions — "why does this work this way?" — scored 65% or higher. Engineers who delegated code generation to AI — "write this for me" — scored below 40%.

The tool didn't cause the skill degradation. The mode of use did. And the dominant mode — the one that feels most productive, the one that produces code fastest, the one that generates satisfaction — is the one that degrades comprehension.

Anthropic noted that their study used a chat interface, not an agentic coding tool like Claude Code. Their expectation: "the impacts of such programs on skill development are likely to be more pronounced than the results here."

The company that sells the tool published the warning. The market continued buying.

The Feedback Loop That Doesn't Correct

Normal feedback loops self-correct. You feel warm but the thermometer reads cold — you trust the thermometer. You feel fast but the clock reads slow — you trust the clock.

AI coding tools break this loop in three places.

First, the individual level. METR's 39-point perception gap is not ignorance. It's that the subjective experience of using AI genuinely feels faster. The tool removes the dead moments — the blank cursor, the uncertain pause, the search for syntax. Those moments feel like waste. They may actually be thinking. When AI fills them, the developer experiences uninterrupted flow. Flow feels fast. It is not the same as fast.

Second, the organizational level. BNY Mellon surveyed 2,989 engineers and found 86% satisfaction. That number goes into a slide deck. An executive reads "86% of engineers satisfied with AI tools" and hears "AI is working." The r = 0.34 correlation — the fact that satisfaction is essentially disconnected from time savings — does not make it into the slide deck. The number that gets measured is the number that gets managed.

Third, the market level. Faros tracked 22,000 developers and found 21% more tasks completed with AI — the headline number. PR review time up 91% — the detail buried in the methodology. Organizational delivery flat — the conclusion nobody wants to fund. At 93% developer adoption and 51% of committed code AI-generated, the industry has decided the question is settled. The data says it isn't.

METR's 2026 follow-up illustrates the problem. The original cohort — the same developers who were 19% slower — showed -18% in the follow-up (CI: -38% to +9%). No statistically significant improvement. But newly recruited developers showed -4% (CI: -15% to +9%). METR concluded that developers are "likely more sped up from AI tools now" — a reasonable inference. But they also noted severe selection effects: the developers willing to participate in a study that previously showed AI making people slower are probably different from those who aren't. The measurement itself is contaminated by belief.

Why It Persists

The perception gap is not a transitional artifact. It is not the kind of confusion that resolves with better tools or more experience. It persists because it serves every actor in the system.

Developers prefer it because the alternative — that the tool they use eight hours a day, that defines their workflow, that they've built their professional identity around — doesn't actually make them faster — is psychologically costly. The BNY Mellon interviews surfaced this directly: developers recognized the skill erosion risk, named it, and kept using the tool. They chose satisfaction over sovereignty.

Organizations prefer it because the alternative — that the $12.8 billion they're spending on AI coding tools generates single-digit-percent productivity gains at best, and possibly negative returns — threatens procurement decisions already made. It is easier to collect satisfaction surveys than to instrument the engineering pipeline for ground-truth measurement.

Vendors prefer it because their business model depends on it. AI coding tools are sold on developer experience — satisfaction, NPS, "most loved" rankings. Claude Code's 91% CSAT and 54 NPS are genuine competitive advantages. They are also perception metrics measuring perception. The product is the feeling.

86% satisfied. 60% saving less than an hour a week.

80% report productivity gains. Deletion and undo rates up.

+20% perceived speedup. -19% measured slowdown.

Five studies. Same gap. No one's incentive to close it.

This is not the Solow paradox restated. Solow's version — "you can see the computer age everywhere except in the productivity statistics" — was about a lag between deployment and organizational adaptation. It resolved. The J-curve bent upward. Firms that restructured around IT eventually captured the gains.

The perception gap is different. Solow's computers didn't make workers believe they were more productive when they weren't. The gap was between investment and outcome. This gap is between experience and outcome — and it's the experience that drives the investment. The very mechanism by which organizations decide whether AI is working (asking developers how they feel) is the mechanism that prevents them from discovering it isn't.

You can feel the AI everywhere. Even in the data that says it isn't there.