The Week Nobody Won

On April 23, OpenAI shipped GPT-5.5. On April 24, DeepSeek shipped V4. The same day, Cursor shipped 3.2. Three frontier releases in 48 hours, each claiming the crown. The headline writers wanted a winner.

There isn't one.

Three Models, Three Axes

For the first time in this landscape, three different models lead three different competitions — and no model wins all three.

Code Quality

Opus 4.7

SWE-bench Pro: 64.3%

vs GPT-5.5 58.6% · V4 55.4%

Orchestration

GPT-5.5

Terminal-Bench: 82.7%

vs Opus 69.4% · V4 67.9%

Price-Performance

DeepSeek V4

SWE-bench: 80.6% at $3.48/M

vs Opus $25/M · GPT-5.5 $30/M

Opus 4.7 resolves the hardest codebases. GPT-5.5 orchestrates the most complex agent workflows. DeepSeek V4 delivers 96% of Opus 4.6's code quality at one-seventh the price, under an MIT license. Three models, three value propositions, zero overlap at the top.

The commodity thesis I've been tracking since March finally has its clearest form: the accessible tier is commoditizing while the frontier is diverging. V4-Pro sits within 0.2 percentage points of Opus 4.6 on SWE-bench Verified — at 7x less cost. But at the frontier, each model has carved out territory the others can't match.

The 86% Problem

The most revealing number from this week isn't a benchmark score. It's a failure mode.

On Artificial Analysis's AA-Omniscience benchmark, GPT-5.5 posted the highest accuracy ever recorded: 57%. It also posted the highest hallucination rate: 86%. When GPT-5.5 doesn't know something, it almost never says so. It guesses, in the same confident tone it uses when it's right.

57%

accuracy (highest ever)

86%

hallucination rate

36%

Opus 4.7 hallucination

For agentic coding — the exact use case where GPT-5.5 leads — this is the wrong tradeoff. A confident wrong action inside an autonomous agent loop is worse than a model that stops and asks. The orchestration king ships the most plausible errors.

Contrast this with DeepSeek V4's self-assessment in its technical report: "trails state-of-the-art frontier models by approximately 3 to 6 months." A model that knows what it doesn't know, priced at one-seventh the cost of models that don't. CodeRabbit's independent testing of GPT-5.5 found the same pattern: it's "quicker, leaner, more direct" but "followed instructions too literally" — less self-corrective than expected. Peter Gostev at Arena.ai measured a 45% pushback rate, the same as GPT-5.4. The capability step is real. The reliability step is not.

The Hardware Story That Isn't

The headline version — "DeepSeek trained V4 on Huawei chips, breaking the NVIDIA monopoly" — is wrong. The reality, traced through MIT Technology Review, The Register, and Reuters: V4-Flash was partially trained on Huawei Ascend 950 chips. V4-Pro training was likely still NVIDIA-dependent. Both models can run inference on Ascend hardware. DeepSeek rewrote core code for Huawei's CANN architecture to bypass CUDA.

This is a step, not the break. But the direction is the story. Jensen Huang warned this exact scenario — domestic Chinese chips becoming viable enough to support frontier AI inference. V3 had to revert to NVIDIA for training. V4 got further. V5 is the one to watch.

The Infrastructure Cracked

While three models shipped, the infrastructure around them strained visibly.

Claude Code's April 23 postmortem revealed three bugs that degraded its quality over six weeks. The most poetic: a caching optimization that wiped the AI's thinking context on every turn. The AI literally lost memory of its own reasoning. Users noticed before Anthropic's internal monitoring did. The company building AI verification tools had its own verification gap.

Cursor 3.2 shipped worktrees, multitask, and multi-root workspaces — the orchestration layer materializing in production. But developer forums immediately flagged stability issues: corrupted chat histories, broken file saves, agents hanging when not in focus. The orchestration future arrived with bugs that an agent-first workflow can't afford.

And SEAL — the independent standardized benchmark the industry relies on for apples-to-apples comparison — still hasn't published results for GPT-5.5 or DeepSeek V4 as of April 25. Every evaluation you've read this week is either self-reported by the vendor or run on custom scaffolding. The market is making adoption decisions on unverified numbers.

What the Week Proved

The JetBrains January 2026 survey landed the same week: 90% of developers use at least one AI tool. 51% of GitHub commits are AI-generated or assisted. Claude Code is the most loved tool on the market (CSAT 91%, NPS 54). And only 29% trust AI output to be accurate.

That gap — 90% using, 29% trusting — is the frame for understanding this week. The models are genuinely diverging, each specializing on different axes. The tools are shipping orchestration features faster than they can stabilize them. The evaluation infrastructure can't keep up with the release cadence. And developers are adopting all of it anyway, because the productivity pull is too strong to resist and the trust infrastructure hasn't been built yet.

Everyone shipped. Nobody verified. The week keeps moving.