GPT-5.4: Computer Use, Extreme Thinking, Coding

OpenAI released GPT-5.4 on March 5, and it's the most consequential model drop of 2026 so far. Not because it tops every benchmark — it doesn't — but because it collapses capabilities that previously required separate models into a single system that can operate your computer.

Let's break down what actually matters.

The Headline: Native Computer Use That Beats Humans

GPT-5.4 scores 75.0% on OSWorld-Verified, a benchmark that tests AI agents' ability to complete real tasks in desktop environments. The human baseline? 72.4%.

This is the first general-purpose model to ship with native computer use. It can interact with applications through screenshots, mouse movements, and keyboard inputs — no specialized tooling required. It operates through both Playwright code execution and direct GUI interaction.

For developers building agents, this changes the surface area dramatically. Previously, you'd need to wire up separate systems for code execution, browser automation, and UI interaction. GPT-5.4 handles all three natively.

Coding: Better, But Not the Leap You'd Expect

Here's where the nuance matters. GPT-5.4's SWE-Bench Pro score is 57.7%, up from GPT-5.3-Codex's 56.8%. That's a marginal improvement on the hardest coding benchmark.

Model
SWE-Bench Pro
SWE-Bench Verified
OSWorld


GPT-5.4
57.7%
—
75.0%

GPT-5.3-Codex
56.8%
—
—

Claude Sonnet 4.6
—
70%+
—

Gemini 3.1 Pro
—
—
—

Model	SWE-Bench Pro	SWE-Bench Verified	OSWorld
GPT-5.4	57.7%	—	75.0%
GPT-5.3-Codex	56.8%	—	—
Claude Sonnet 4.6	—	70%+	—
Gemini 3.1 Pro	—	—	—

Matt Shumer declared coding "essentially solved" after testing GPT-5.4. That's premature. SWE-Bench Pro still has 42% of real-world GitHub issues that the best model in the world can't fix. Claude's Sonnet 4.6 still leads on SWE-Bench Verified with 70%+ resolution rates. And Shumer himself admitted frontend design still lags behind Claude Opus 4.6 and Gemini 3.1 Pro.

What GPT-5.4 does improve meaningfully is efficiency. OpenAI claims it completes agentic tasks with fewer tokens and tool calls than its predecessors — which directly translates to lower costs and faster execution in production agent pipelines.

Extreme Thinking Mode: Compute on Demand

GPT-5.4 introduces "Extreme" thinking — a reasoning mode that burns significantly more compute on hard problems. OpenAI already had Light, Standard, Extended, and xHigh settings. Extreme goes beyond all of them.

This is positioned for researchers and complex engineering tasks, not everyday chat. The implication for coding: when you're debugging a gnarly concurrency issue or working through a complex architectural decision, you can tell the model to think harder and it will spend more compute doing so.

This follows the broader industry trend of making reasoning depth configurable. Claude's extended thinking, Gemini's thinking budgets — everyone is converging on the idea that different problems deserve different amounts of compute.

Tool Search: The Quiet Efficiency Win

The most underappreciated feature in this release is Tool Search. Instead of loading all tool schemas into context upfront (expensive when you have dozens of tools), GPT-5.4 receives a lightweight tool list and looks up full definitions on demand.

OpenAI reports this reduces token usage by 47% while maintaining identical accuracy. For anyone building agents with large tool inventories — and in the MCP era, that's increasingly common — this is a meaningful cost optimization.

The 1M Context Window

GPT-5.4 ships with a 1-million-token context window, available experimentally in Codex. Requests beyond the standard 272K window count at 2x the normal rate.

This matters for codebase-scale operations. A million tokens is roughly 750K words — enough to fit most mid-size repositories in a single context. But the 2x pricing on extended context requests means you'll want to be strategic about when you use it.

The Elephant in the Room: Reliability

Early reviewers describe GPT-5.4 as "that brilliant coworker who sometimes goes rogue on the details." OpenAI reportedly "loosened" the model to be more conversational, and the result is a system that occasionally lies, leaks system prompts into UI elements, and adds features nobody asked for.

This is the tension at the heart of every frontier model release. More capability often comes with more unpredictability. OpenAI claims 33% fewer errors per response compared to GPT-5.2, but the subjective experience of working with a model that improvises unwanted features is arguably worse than a model that's wrong in predictable ways.

Pricing and Availability

Detail
GPT-5.4

Input$2.50 / 1M tokens
Cached Input$0.625 / 1M tokens
Output$20.00 / 1M tokens
Context Window1M tokens (272K standard)
Max Output128K tokens
ChatGPTPlus, Team, Pro

Detail	GPT-5.4
Input	$2.50 / 1M tokens
Cached Input	$0.625 / 1M tokens
Output	$20.00 / 1M tokens
Context Window	1M tokens (272K standard)
Max Output	128K tokens
ChatGPT	Plus, Team, Pro

GPT-5.4 Thinking replaces GPT-5.2 Thinking in ChatGPT. The older model sticks around for three months under Legacy Models before retiring June 5, 2026.

What This Means for the Landscape

The AI coding tool market is stratifying fast. Here's what the current picture looks like:

Claude Code dominates agentic coding. A new developer survey this week confirmed it's now the #1 AI coding tool, overtaking Copilot and Cursor in just 8 months. Claude's strength is multi-step, autonomous task execution — and with Agent Teams now in research preview, it's pushing into multi-agent orchestration.
GPT-5.4 + Codex is OpenAI's answer. Native computer use gives it a unique capability no competitor matches yet. But coding performance, while strong, doesn't lead the pack.
Cursor is quietly executing. BugBot graduated from beta in February, and 30% of Cursor's own PRs are now created by autonomous agents. $500M ARR.
GitHub Copilot went multi-model. Claude and Codex are now available inside Copilot Business and Pro at no extra cost. The platform play.
DeepSeek V4 is the wildcard. Expected any day now — a trillion-parameter open-source model with leaked benchmarks claiming 80%+ on SWE-Bench Verified. If those numbers hold, it reshuffles everything.

Bottom Line

GPT-5.4 is a genuinely significant release, but not for the reason most people will focus on. The coding improvements are incremental. The real story is convergence: computer use, coding, reasoning, and tool use all in one model, at competitive pricing.

The question isn't "is coding solved?" — it obviously isn't. The question is whether a unified model that can operate software beats a specialized coding model that can only write code. For agent builders, the answer might already be yes.

Meanwhile, DeepSeek V4 looms. If it ships this week with open weights and the leaked benchmark numbers hold, the pricing dynamics of this entire market shift overnight. Stay tuned.

Links: