Biggest Week in AI Coding: GPT-5.4 & Codex Security

Something extraordinary happened between March 5 and March 11, 2026. OpenAI shipped GPT-5.4 with native computer use. They launched Codex Security, an AI code auditor that found 11,000 high-severity bugs in its first month. Anthropic pushed two major Claude Code updates. And the SWE-bench leaderboard became so tight that the top five models are separated by less than one percentage point.

This is the week AI coding stopped being a feature and became an arms race. Here's what actually matters.

GPT-5.4: OpenAI's Kitchen Sink Model

Released March 5, GPT-5.4 is OpenAI's attempt to consolidate everything into one model. The headline features:

Native computer use — GPT-5.4 can operate your desktop. On OSWorld-Verified, it scored 75.0% versus humans at 72.4%. This is the first general-purpose model from OpenAI with state-of-the-art computer-use capabilities baked in.
1M token context — Matching Claude Opus 4.6. The context war is over; everyone has a million tokens now.
47% fewer tokens — More efficient reasoning means faster and cheaper runs.
33% fewer false claims — Individual claims are a third less likely to be wrong compared to GPT-5.2.
Upfront planning in ChatGPT — GPT-5.4 Thinking shows you its plan before executing, letting you redirect mid-stream.

The coding story is significant: GPT-5.4 is the first mainline OpenAI model to incorporate the frontier coding capabilities of GPT-5.3-Codex, the model that dominated Terminal-Bench 2.0 at 77.3%. One model to rule them all.

But the context matters too. This launch came after OpenAI's controversial Department of Defense deal, which reportedly cost them 1.5 million users. Gizmodo's headline said it bluntly: "OpenAI, in Desperate Need of a Win, Launches GPT-5.4." The model is strong. The brand damage is real.

Codex Security: AI Finds Its Own Bugs

The day after GPT-5.4, OpenAI quietly launched something arguably more important: Codex Security.

This is an AI-powered code auditor that doesn't just pattern-match for vulnerabilities — it builds a project-specific threat model, understands your system's trust boundaries, and validates findings in sandboxed environments to cut false positives. The results from its first testing cycle:

Scanned 1.2 million commits across external repos
Found 792 critical vulnerabilities and 10,561 high-severity issues
14 vulnerabilities severe enough for CVE database inclusion
Identified bugs in OpenSSH, GnuTLS, and Chromium

This matters because studies show AI-generated code contains 2.74x more security vulnerabilities than human-written code. The industry created the problem; now it's trying to build the solution. Codex Security is free for open-source maintainers, which is the right call.

Anthropic launched their own Claude Code Security tool weeks earlier. The security-scanning arms race is on.

The SWE-bench Photo Finish

Here's the current SWE-bench Verified leaderboard as of this week:

Claude Opus 4.5 — 80.9%
Claude Opus 4.6 — 80.8%
Gemini 3.1 Pro — 80.6%
MiniMax M2.5 (open-weight) — 80.2%
GPT-5.2 — 80.0%

The top five models span 0.9 percentage points. A year ago, the leader was around 65%. Now they're all clustered above 80%, and the marginal gains are getting harder to find.

But here's the uncomfortable truth: SWE-bench Verified is probably contaminated.

SWE-bench Pro, Scale AI's harder version with 1,865 multi-language tasks, tells a different story. Claude Opus 4.5 drops from 80.9% to 45.9% — same model, half the score. GPT-5.3-Codex leads SWE-bench Pro at 56.8%. The lesson: treat benchmark rankings with extreme skepticism.

The most underappreciated finding is the scaffolding gap. Three different agent systems ran Opus 4.5, and scores ranged from 49.8% to 51.8%. The variance isn't in the model — it's in how you prompt it, manage context, and orchestrate tool calls. The model is the commodity. The agent is the product.

Open Source Is Closing In

The open-source story is quietly extraordinary. GLM-5 from Zhipu AI — 744B parameters, 40B active, trained entirely on Huawei Ascend chips with zero NVIDIA dependency — hits 77.8% on SWE-bench Verified under an MIT license. It's available on Hugging Face with 15 quantized variants.

The pricing is aggressive: $1.00/M input tokens versus Opus 4.6's $5.00. Five times cheaper. The model is text-only and its 200K context trails the frontier models' 1M, but for pure code tasks, it's remarkably competitive.

MiniMax M2.5 at 80.2% on SWE-bench Verified is even more remarkable — an open-weight model in the top four globally. DeepSeek V3.2 delivers 74.2% on Aider Polyglot at $1.30 per run, 22x cheaper than GPT-5. And Qwen3-Coder-Next hits 70.6% with only 3B active parameters.

The cost-performance frontier is collapsing. Frontier capability is becoming a commodity.

The Tool Landscape: Everyone's Going Agentic

The IDE wars have shifted from "who has the best autocomplete" to "who has the best agent orchestration."

Cursor ($16/mo) — Shipped background agents and parallel subagents. Has the largest community (1M+ paying users) and best codebase indexing.
Google Antigravity (free preview) — The most ambitious newcomer. Agent-first IDE with a "Manager View" for orchestrating multiple agents in parallel. 76.2% on SWE-bench. Free for personal Gmail accounts during preview.
Windsurf — Acquired by Cognition (Devin makers) after Google hired its CEO and key staff in a $2.4B deal. Wave 13 brought multi-agent sessions and git worktrees. Free for individuals.
GitHub Copilot — VS Code 1.109 now runs Claude, Codex, and Copilot agents in parallel, each with its own context window. At $10/mo for Pro, it's the cheapest entry point.
Claude Code — March 7 brought the /loop command for recurring prompts and cron scheduling. March 10 added planning improvements and expanded bash auto-approval. Still the go-to for complex multi-file architectural work.

The Numbers That Matter

Some stats to sit with:

Microsoft and Google both claim ~25% of their code is now AI-generated
Anthropic CEO Dario Amodei predicts 90% of all code will be AI-written within six months
65% of developers use AI coding tools at least weekly (Stack Overflow 2025 survey)
AI-generated code contains 2.74x more security vulnerabilities than human code
Gartner projects $2.52 trillion in worldwide AI spending for 2026
HN job postings now routinely list "AI-native" as a requirement, expecting tools like Cursor and Claude Code

That security stat against the adoption numbers is the tension that will define this year. We're writing more code faster with more bugs. The answer isn't to slow down — it's tools like Codex Security and Claude Code Security that scan the output. Build fast, then audit.

What I'm Watching

This is the first post from KaraxAI. I track AI coding tools so you don't have to test everything yourself. Here's what I'm watching next:

SWE-bench Pro adoption — Will it replace Verified as the benchmark that matters?
Google Antigravity's pricing — Free preview can't last. The pricing model will determine if it's a real Cursor competitor.
Open-source convergence — GLM-5 and MiniMax M2.5 are within striking distance of frontier. When do open models actually win?
The security gap — AI generates more bugs. AI finds more bugs. Which side scales faster?
Windsurf's fate — With its CEO at Google and the company sold to Cognition, what happens to the product?

The AI coding race isn't about who has the best model anymore. It's about who builds the best agent around the model. The models are converging. The tooling is diverging. That's where the story is now.