The Day the Leaderboard Split in Two

Two press releases. Same day. Opposite directions.

On April 7, Anthropic launched Project Glasswing: Claude Mythos Preview, the most capable coding model ever built, restricted to 52 organizations for defensive security work. 77.8% on SWE-bench Pro. Twenty points clear of the next model. Not available for purchase.

The same morning, Z.ai released GLM-5.1: 754 billion parameters, MIT license, weights on HuggingFace, open API on OpenRouter. 58.4% on SWE-bench Pro. The highest score ever posted by a model you can actually use.

The leaderboard didn't just update. It fractured.

The Scoreboard

MODEL	SWE-BENCH PRO	$/M INPUT	ACCESS
Claude Mythos Preview	77.8%	$25.00	52 ORGS
━━━━━━━━━━ 19.4 POINT GAP ━━━━━━━━━━
GLM-5.1 MIT	58.4%	$1.40	OPEN
GPT-5.4	57.7%	$2.50	API
Claude Code (Opus 4.5)	55.4%	$5.00	API
Qwen 3.6-Plus	56.6%	$0.33	API
Auggie (Opus 4.5)	51.8%	$5.00	API

SWE-bench Pro, Agent Systems tier. Custom scaffolding. Sources: BenchLM, Scale SEAL.

Read the Access column. The best model on the board is behind a wall. The second-best model is the most open thing on the list. That's not a coincidence. It's two visions of how AI should work, launching on the same morning.

What GLM-5.1 Actually Is

Z.ai (formerly Zhipu AI, Tsinghua spinoff, Hong Kong IPO in January, ~$31B market cap) built GLM-5.1 as a point release of their GLM-5 base. Same 754B-parameter mixture-of-experts architecture — 256 experts, 8 active per token, ~40B parameters per forward pass. What changed is the post-training: they re-ran their SLIME asynchronous RL pipeline with a training distribution targeted at coding tasks. The result is a 28% coding improvement over GLM-5 without architectural changes.

The hardware line is the one that should make you look twice. GLM-5.1 was trained on 100,000 Huawei Ascend 910B chips. Zero NVIDIA silicon. Z.ai has been on the US Entity List since January 2025 — they had no choice. But the result speaks for itself: a model that beats every commercially available model on SWE-bench Pro, trained entirely on domestic Chinese hardware using Huawei's MindSpore framework.

The eight-hour claim. Z.ai's headline demo: GLM-5.1 built a functional Linux desktop environment (file browser, terminal, text editor, games) from scratch in 655 autonomous iterations. A separate run optimized a vector database from baseline to 21,500 queries per second through 600+ iterations and 6,000+ tool calls — a 6x improvement over what a standard 50-turn session could achieve. These are Z.ai's own demonstrations, reported by VentureBeat, not independently audited.

The Caveat That Matters

I would be failing my own standards if I didn't flag the asterisk. GLM-5.1's 58.4% is on the Agent Systems tier of SWE-bench Pro — meaning Z.ai submitted using their own agent scaffolding, not the standardized SEAL methodology that controls for scaffolding differences.

This matters because scaffolding creates enormous score swings. The same Opus 4.5 model scores 45.9% on SEAL's standardized test but 55.4% when wrapped in Claude Code's agent harness — a 10-point gap from scaffolding alone. I covered this in The Moat That Lasted 72 Hours. The gap hasn't gone away.

Z.ai hasn't disclosed what agent system produced their 58.4%. Until they do — or until SEAL runs GLM-5.1 through standardized evaluation — the exact ranking among available models carries an asterisk. The number is real. The ranking is provisional.

And on the broader coding composite (Terminal-Bench 2.0 plus NL2Repo), Claude Opus 4.6 still leads at 57.5 versus GLM-5.1's 54.9. The "beats Claude" headline is accurate on SWE-bench Pro specifically, not universally.

The Price Column

Even with the scaffolding caveat, look at the pricing table again. GLM-5.1 costs $1.40 per million input tokens. Claude Opus 4.5 costs $5.00. GPT-5.4 costs $2.50. Qwen 3.6-Plus costs $0.33.

OPUS 4.5

$25.00

/M output

GPT-5.4

$15.00

/M output

GLM-5.1

$4.40

/M output

QWEN 3.6+

$1.95

/M output

The two models from Chinese labs — both open-weight, both available today — cost a fraction of the proprietary alternatives while posting competitive or superior benchmark scores. The pricing spread between the top and bottom of the publicly available tier is now 13x.

What Split

Here is what happened to the AI coding market on April 7, 2026:

Tier 1: The restricted frontier. One model. 77.8% SWE-bench Pro. Available to 52 organizations, exclusively for security work. $25 per million input tokens. You can't buy it. You can't rent it. You have to be invited, and even then you can only use it to find vulnerabilities, not write production code.

Tier 2: Everything else. A ~5-point spread from 53% to 58%. Six major models. Three open-weight, three proprietary. Pricing ranges from $0.33 to $5.00 per million input. And the model at the top of this tier — the one that edges out GPT-5.4 and Claude Code — is MIT-licensed and trained on Huawei chips.

This is the market now. The commodity thesis I've been tracking since article #32 — corrected in #33, extended in #36 and #38, apparently killed at the frontier in #39 — has reached its final form. At the restricted frontier, capability divergence is massive and growing. In the tier you can actually access, convergence is nearly complete, open-source is leading, and the differentiator is scaffolding quality and cost.

Both things are true at the same time. They describe different markets serving different purposes. Mythos exists to find zero-days in OpenBSD. GLM-5.1 exists to ship your feature branch. They will never compete because they were never in the same market.

The question for every developer and engineering leader is simple: which tier are you in? For 99.9% of you, the answer is Tier 2. And in Tier 2, the best model is open-source, costs $1.40 per million tokens, and was trained without a single NVIDIA chip.

Sources: Anthropic Project Glasswing · Z.ai GLM-5.1 docs · BenchLM SWE-bench Pro · Scale SEAL leaderboard · VentureBeat · MarkTechPost · Simon Willison · OpenRouter · SLIME GitHub · CNBC (Z.ai IPO)