3 min read

I Was Wrong About Models Being Commodities. Here's What the Clean Data Shows.

I Was Wrong About Models Being Commodities. Here's What the Clean Data Shows.

SIGNAL NOTE — A short-form correction. When I get something wrong, I say so.

I've written some version of "the model is a commodity" in at least six articles. The evidence seemed bulletproof: the top five models on SWE-bench Verified were within 0.9 percentage points of each other. Opus 4.5 at 80.9%, GPT-5.2 at 80.0%. A rounding error.

Then OpenAI retired SWE-bench Verified on February 23, 2026. The reason: contamination. The 500 Python-only tasks had leaked into training data. The benchmark that proved models were commodities was compromised.

What the Clean Benchmark Shows

SWE-bench Pro replaced it — 1,865 tasks across 41 repositories, multi-language, with licensing designed to prevent contamination. Scale AI runs a standardized evaluation (SEAL) where every model gets the same scaffolding (SWE-Agent, 250 turns). Here's what it shows:

Model SEAL (Standardized) Best Agent System
Claude Opus 4.5 45.9% ±3.6 55.4% (Claude Code)
Claude Sonnet 4.5 43.6% ±3.6
Gemini 3 Pro 43.3% ±3.6
GPT-5 (High) 41.8% ±3.5 57.7% (GPT-5.4)

The spread on standardized scaffolding is ~5 percentage points. The confidence intervals overlap for adjacent pairs. Scale's own analysis groups the top four as statistically tied at rank 1. So models are still close — just not as close as Verified made them look, and the gap is real enough to matter on harder problems.

But Look at the Right Column

The same model — Opus 4.5 — scores 45.9% on SEAL, 50.2% inside Cursor, 51.8% inside Augment, and 55.4% inside Claude Code. That's a 10-point spread from scaffolding alone. The best agent system (GPT-5.4 at 57.7%) leads the best SEAL model (Opus 4.5 at 45.9%) by 12 points — and most of that gap is architecture, not model quality.

The correction: "Models are commodities" was right on Verified — but Verified was contaminated. On a clean benchmark, models differ by ~5pp. Scaffolding differs by ~10pp. The claim needs a qualifier: models are near-commodities; scaffolding is not.

Why This Strengthens the Thesis

Here's the part I didn't expect. The self-correction actually reinforces the infrastructure argument I've been making since article #15. If models were truly interchangeable (the Verified picture), then the scaffolding advantage would be moderate. But if models differ modestly while scaffolding creates a 2x larger spread, the case for investing in infrastructure over chasing frontier models is stronger, not weaker.

Stripe doesn't ship 1,300 PRs per week because they have access to a special model. They ship because they built devboxes, type systems, and CI pipelines that make any sufficiently capable model productive. The 5pp model gap matters. The 10pp scaffolding gap matters more.

One more number. On SWE-bench Pro Private — 276 instances from 18 proprietary codebases that no model has seen — scores collapse to ~23% for the best models. From 81% on Verified to 46% on Pro to 23% on Private. The harder and more realistic the benchmark, the more everything matters — but at every difficulty level, how you wrap the model matters more than which model you chose.

Sources: Scale SEAL Leaderboard · Morph LLM Analysis · SWE-bench Pro Paper (arXiv 2509.16941) · Augment Code Blog