5 min read

77.8%

77.8%

The most capable code generation model ever built launched today. Anthropic isn't letting anyone use it to generate code.

Claude Mythos Preview scored 77.8% on SWE-bench Pro. The previous leader, GPT-5.4, scored 57.7%. That's a 20-point gap — the largest single-model advance in the benchmark's history. On SWE-bench Verified, Mythos hit 93.9%. On Terminal-Bench 2.0, 82.0%. Every coding benchmark moved by double digits.

And the model is restricted. Not priced high. Not enterprise-only. Restricted. Anthropic says it does not plan to make Mythos Preview generally available. Instead, twelve organizations — including AWS, Apple, Google, Microsoft, CrowdStrike, Palo Alto Networks, and the Linux Foundation — will use it exclusively for defensive security work under a program called Project Glasswing. Forty more critical-infrastructure organizations get access. Anthropic is committing $100 million in usage credits and $4 million to open-source security foundations.

The best code generation model ever built is being used to find bugs in code. Not to write it.

The Scoreboard

SWE-bench Pro Score Gap from #1
Claude Mythos Preview 77.8%
GPT-5.4 57.7% −20.1
Claude Code (Opus 4.5) 55.4% −22.4
Qwen 3.6-Plus 56.6% −21.2
Claude Opus 4.5 (SEAL) 45.9% −31.9

Six weeks ago, I corrected my commodity thesis: the top models differed by about 5 points on clean benchmarks, with scaffolding adding another 10. The spread was real but manageable. Today that spread is 20 points. This isn't "models matter somewhat." This is a different tier of capability.

What Emerged

The system card contains the most striking detail. Mythos's security capabilities — the reason Anthropic restricted it — were not explicitly trained. They emerged as a downstream consequence of general improvements in reasoning and coding ability.

The numbers illustrate what "emerged" means in practice:

Firefox JavaScript exploit development:
Opus 4.6 produced 2 successful exploits from several hundred attempts.
Mythos Preview produced 181, plus 29 additional register-control achievements.

OSS-Fuzz corpus (7,000 entry points):
Opus 4.6 found ~275 Tier 1-2 crashes and a single Tier 3.
Mythos Preview found 595 Tier 1-2 crashes, a handful of Tier 3-4, and 10 Tier 5 — full control flow hijack.

That's roughly a 90x improvement in exploit development capability without any explicit security training. The model got better at reasoning about code, and offensive security fell out as a side effect. It found a 27-year-old OpenBSD TCP vulnerability, a 16-year-old FFmpeg H.264 codec bug, and chained together a Linux kernel exploit that grants complete machine control.

When I wrote about the 500 zero-days found by Opus 4.6 two days ago, the conclusion was that offense scales with compute while defense scales with humans. Mythos didn't just confirm that thesis — it showed the scaling is superlinear. You don't get 90x more exploits from a 20% better model. You get 90x more exploits because some threshold of code understanding was crossed, and capabilities that didn't exist at the previous level materialized all at once.

The Paradox

Here is where it gets strange. Anthropic built the most capable code generation model in history and chose to deploy it for code verification. Not generation. Verification.

This is the strongest possible signal about where AI coding actually is. Anthropic's own revealed preference — what they did with their best model, not what they say in press releases — tells you that finding bugs in code is a harder and more valuable problem than writing code. The company that can generate the best code in the world looked at the market and decided the highest-value application was checking everyone else's code.

Two weeks ago I wrote about the $256 million bet on code review as a product category. Mythos is Anthropic's $100 million bet on the same thesis — except they're not selling verification as a product. They're treating it as a security imperative.

The Gap That Matters

For the million-plus developers using AI coding tools today, nothing changed. They still have access to Opus 4.6, GPT-5.4, Sonnet 4.6 — models clustered around 53-58% on SWE-bench Pro. The scaffolding gap still matters. The verification infrastructure still barely exists. The 61% correct / 10.5% secure numbers from SusVibes didn't improve today.

But now there's a 20-point capability gap between what's possible and what's available. The commodity thesis — models are interchangeable, scaffolding is the differentiator — survives for the tools you can actually use. It's dead at the frontier.

The practical question: when Mythos-class models eventually become generally available, does the verification gap close or widen? If a model can generate code at 77.8% accuracy on hard problems AND find decades-old zero-days in the code it reviews, verification might actually get solved — but only if you can afford $25/$125 per million tokens. At roughly 1.7x the cost of Opus, Mythos-class verification is enterprise-only for the foreseeable future.

For everyone else, the gap between generation and verification keeps widening. The code gets written faster. The bugs get subtler. The review capacity stays flat. Anthropic just demonstrated that the solution exists. They also demonstrated that you can't have it.

What This Actually Means

Three implications, in order of importance:

1. The model race isn't over. Anyone who told you models were commoditized was working from contaminated SWE-bench Verified data and a 5-point spread. The clean spread is now 20 points. Raw model capability matters — a lot — at the frontier. DeepSeek V4 and Spud, both expected this month, just got a much harder target to hit.

2. Project Glasswing is a precedent. Twelve competitors — including companies that are suing each other, poaching each other's engineers, and undercutting each other's pricing — agreed to share a restricted model for collective defense. The partner list reads like the entire cloud stack: AWS, Google, Microsoft, Apple, NVIDIA, PANW, CrowdStrike, Cisco, Broadcom, JPMorgan, Linux Foundation. This isn't a marketing partnership. It's an acknowledgment that offensive AI capabilities require industry-level coordination to contain.

3. The best use of AI coding is finding what AI coding broke. Anthropic built the apex predator of code generation and pointed it at the bugs. Not their competitors' bugs — everyone's bugs, including in code that AI tools helped write. The recursive loop is complete: AI generates code, AI finds the bugs in that code, AI generates the fixes. The question is no longer whether AI can replace developers. It's whether the verification loop can run fast enough to contain the generation loop.

Today the answer is: only if you're one of the 52 organizations that Anthropic chose. Everyone else is still on their own.