The Dashboard

Faros AI published their 2026 AI Engineering Impact Report in March. Two years of telemetry. 22,000 developers. 4,000 teams. The most comprehensive dataset anyone has assembled on what actually happens when engineering organizations adopt AI at scale.

The top-line numbers were excellent.

Epics completed per developer: up 66%. Task throughput per developer: up 33.7%. PR merge rate: up 16.2%. The kind of numbers that make a quarterly review sing. The kind of numbers that appear on the slide deck the CTO shows the board.

Same dataset. Same 22,000 developers. Same two years.

What the dashboard shows	Change
Epics completed per developer	+66%
Task throughput per developer	+33.7%
PR merge rate per developer	+16.2%
What the code shows	Change
Code churn	+861%
Bugs per developer	+54%
Incidents-to-PR ratio	+242.7%
Monthly incidents	+57.9%
Median time in review	+441.5%
PRs merged without review	+31.3%

The same organizations showing +66% epic completion also showed +861% code churn — nearly ten times more code being deleted relative to what was added. The same teams merging 16% more PRs were also skipping reviews 31% more often and seeing incidents per PR rise 243%.

Both sets of numbers are accurate. Both come from the same telemetry. They describe the same organizations over the same period. The question is which set reaches the people making decisions.

The Pipeline Inverts

CircleCI's 2026 State of Software Delivery report — 28 million workflows, the largest CI/CD dataset ever assembled — found the same pattern at the pipeline level. Daily workflow runs increased 59% year over year, the biggest throughput spike they've ever measured. But when they separated feature branches from main:

Feature branch throughput for the median team: up 15%. Main branch throughput for the median team: down 7%.

More code is entering the pipeline. Less code is exiting it. Main branch success rate fell to 70.8% — the lowest in five years, against a benchmark of 90%. Recovery time climbed to 72 minutes, up 13%. And the gains were wildly concentrated: the top 5% of teams increased throughput 97%. The median team got 4%. The bottom quartile got nothing.

The pipeline is expanding at the input and contracting at the output. But throughput — the number that appears on the dashboard — only measures the input.

Where the Confidence Comes From

CloudBees surveyed 213 enterprise technology leaders in May 2026 and found that 92% expressed confidence their AI-generated code was production-ready. Then 81% reported production failures from that code.

This is not ignorance. These are senior technology leaders who understand software. The confidence is rational given what they're seeing — and what they're seeing is the dashboard. Epics are up. PRs are merging. Velocity is climbing. The numbers they track are genuinely improving.

The numbers they don't track are the ones collapsing. Only 56% of these organizations always enforce formal code review. Only 12% have a dedicated AI governance function. Only 31% can attribute AI spending to specific business outcomes. 36% don't track AI spending at all.

The velocity metrics are automated. They come from Git, from the CI/CD pipeline, from the project management tool. They arrive without anyone asking for them. The quality metrics — incident attribution, production failure rates, code churn ratios, review depth — require human effort to collect, human judgment to interpret, and organizational will to route to the right people. The fast metrics are always available. The slow metrics arrive late, on different screens, to different teams.

Where the Distrust Lives

The developers know. Sonar's 2026 survey: 96% don't fully trust AI-generated code. Stack Overflow: trust in AI tools dropped from 40% to 29% in a single year, even as usage rose to 84%. More developers actively distrust AI accuracy (46%) than trust it (33%).

These developers are inside the same organizations where 92% of leaders are confident. The distrust is correct — 81% of those organizations had production failures. But the distrust doesn't propagate upward. The mechanism is metric selection: the signals that reach leadership are the automated velocity numbers where AI looks good. The signals that would correct leadership's confidence live in incident reports, code review comments, and the professional judgment of the people closest to the code — channels that are slower, noisier, and easier to dismiss as anecdotal.

"AI generates code faster than teams can validate it." — Jacob Krell, Suzu Labs, quoted in The Register

This has been said before. What hasn't been said is that the organizational response to this fact is to measure the generation speed and not the validation failure rate. The dashboard creates its own reality: if you track velocity, velocity improves. If you don't track quality, quality is invisible until it hits production — at which point the incident gets attributed to the deployment, the configuration, the infrastructure, anything but the metric architecture that routed the wrong signal to the top.

The Structural Problem

Velocity is easy to measure because generation is easy to automate. Commits, PRs, story points completed — these are events in a system that logs events. Quality is hard to measure because verification is hard to automate. Whether a piece of code is correct, secure, maintainable, and aligned with the system's actual needs — these are judgments that require context, experience, and attention. The very things that AI review is structurally poor at providing.

The result is a measurement bias baked into the infrastructure. Organizations didn't choose to watch velocity and ignore quality. The tools they already had — Git analytics, CI/CD dashboards, project management platforms — naturally measure what AI accelerates. The tools they'd need to measure what AI degrades — production quality attribution, code churn analysis, review depth tracking, cognitive load assessment — either don't exist, aren't integrated, or require the kind of dedicated governance that only 12% of organizations have built.

The blind spot isn't in the leadership. It's in the metric stack. The leaders are reading the data that arrives. The data that would change their minds doesn't.