The Post-Training Inversion

"Pre-training is kind of a commodity now. The frontier, if you will, is actually in post-training."

— Eoghan McCabe, CEO of Intercom, March 2026

McCabe said this after his company's custom model — built on an undisclosed open-weights base, post-trained on billions of proprietary interactions — beat GPT-5.4 and Claude Opus 4.5 on the metric that actually matters: resolving customer conversations without human intervention. Fin Apex 1.0 hits 73.1%. The frontier models it replaced hit 71.1%.

He's not alone. Across eight companies in six different domains, the same pattern keeps appearing: take an open-weight base model, apply domain-specific post-training, and outperform the frontier at a fraction of the cost. The economics are lopsided enough to be structural. The evidence is now broad enough to name it.

I'm calling it the post-training inversion: the point at which the value layer in AI shifted from pre-training (building the base model) to post-training (adapting it to your domain). Pre-training is becoming infrastructure — expensive, essential, and increasingly commoditized. Post-training is where differentiation lives.

The Evidence

Eight case studies. Six domains. One pattern.

Company	Domain	What They Did	Result vs. Frontier
Intercom	Customer service	Open-weight base + billions of proprietary conversations	73.1% vs 71.1% GPT-5.4 — at ~1/5 cost
Cursor	Code generation	Fine-tuned Kimi K2.5 for Composer 2	61.3% CursorBench, beats Opus 4.6 — 86% cheaper
Harvey	Legal	GPT-4/5 base + 10B+ legal tokens, three-layer stack	97% lawyer preference, 0.2% hallucination
Parsed / Together AI	Healthcare scribing	Gemma 3 27B fine-tuned on clinical data	From 35% below Sonnet 4 → 60% above it
Airbnb	Travel / UX	13 fine-tuned models for different workflows	30% user requests handled end-to-end
Ambience Healthcare	Clinical charting	Specialty-specific models for 200+ medical specialties	45% charting time reduction, Cleveland Clinic deployed
SWE-bench scaffolding	Coding agents	Same model (Opus 4.5), different agent systems	10pp spread from scaffolding alone (45.9% → 55.4%)
OpenAI gpt-oss	Open ecosystem	120B params, Apache 2.0 — the lab releasing its own base	Confirms inversion: "differentiation comes from high-signal data"

Each of these tells a different part of the same story. Let me trace the three strongest.

Intercom: Domain Data as Moat

Fin Apex 1.0 resolves 2 million+ customer conversations per week. It approaches $100 million in annual recurring revenue. The 60-person AI team that built it grew from 6 researchers over three years. The model runs at approximately one-fifth the cost of frontier APIs.

The key insight is what the training data is. Intercom doesn't just have conversation logs — it has resolution outcomes. Billions of interactions where a question was asked, an answer was given, and the customer either needed a human or didn't. That's a natural reward signal. The post-training isn't just imitation learning. It's outcome-optimized.

A gaming customer saw resolution rates jump from 68% to 75% overnight — a 22% reduction in unresolved conversations — when they switched from GPT-5.4 to Apex 1.0. That 2 percentage point absolute gain over frontier represents a 22% relative reduction in failure. At 2 million conversations per week, even small percentage improvements translate to hundreds of thousands fewer human escalations per month.

The moat isn't the model architecture. It's the data flywheel: more conversations generate better outcome data, which improves the model, which resolves more conversations. A frontier lab can build a better base model. It can't build three years of domain-specific outcome data.

Cursor: Post-Training Your Competitor's Model

Cursor's Composer 2, launched in March 2026, runs on a fine-tuned version of Kimi K2.5 — an open-weight model from the Chinese lab Moonshot AI. On CursorBench (Cursor's internal evaluation for multi-file editing), the fine-tuned K2.5 scores 61.3%, beating Claude Opus 4.6's score on the same benchmark. It costs 86% less to run.

Think about what this means. Cursor, the fastest-growing SaaS company in history at $2B ARR, found it more effective to take an open-weight Chinese model and post-train it on their own coding data than to use the best model from their supplier. The supplier whose model it beats — Anthropic — is also Cursor's direct competitor via Claude Code.

This is the post-training inversion made concrete. The base model is a commodity input. The value comes from Cursor's proprietary data: millions of multi-file editing sessions, acceptance/rejection signals, cursor position data, the entire corpus of how real developers actually interact with AI code generation. That data doesn't exist anywhere else.

Parsed: From 35% Below to 60% Above

The most dramatic single data point comes from Parsed, a clinical scribing company working with Together AI. They started with Google's Gemma 3 at 27 billion parameters — a model that scored 35% below Claude Sonnet 4 on their clinical scribing benchmark. After fine-tuning on their domain-specific data, the same model architecture scored 60% above Sonnet 4.

That's a 95-percentage-point swing from a single round of fine-tuning on a 27B model. The economics are even more striking: fine-tuning a 27B model on Together AI's platform costs under $10. Running it costs 10-100x less than frontier API calls.

Parsed's CTO noted something else: the evaluation harness they built to measure scribing quality became itself a form of moat. The reward signal for reinforcement learning was derived from their clinical accuracy metrics — something you can only build if you deeply understand the domain. Post-training isn't just "throw data at a model." It's building the evaluation infrastructure that tells you what good looks like.

The Economics

The numbers explain why this is happening now.

Pre-Training

$100M+

Current frontier model training cost

Cost trajectory: 2.4x/year

Approaching $1B by 2027

Dario Amodei: $10-100B coming

Post-Training

1-5%

Of pre-training cost for domain SFT/RL

Fine-tuning 27B model: under $10

Inference cost decline: 200x/year

Self-host breakeven: 15-40M tokens/month

Sources: Epoch AI inference trends, company disclosures

Pre-training costs are rising at 2.4x per year — from tens of millions in 2024 to hundreds of millions today, on track for a billion by 2027. Meanwhile, Epoch AI reports that inference costs have fallen 200x per year since January 2024 (up from 50x/year before that). The median price for GPT-4-level performance declined 40x per year.

This creates a scissors effect. Pre-training becomes harder to justify economically unless you're a hyperscaler. Post-training becomes accessible to any company with domain data and a few thousand dollars. The barrier to building a competitive model is no longer compute — it's data quality and evaluation infrastructure.

As Seldo put it: "2026 is the year of fine-tuned small models." Diminishing frontier returns plus margin pressure are pushing companies toward fine-tuned small models. Cursor and Airbnb are already doing it. New services are emerging that train models for companies without in-house AI researchers.

The Platform Signal

The strongest confirmation that the inversion is real comes from the companies positioned to profit from it.

Amazon Nova Forge is the first cloud provider offering what it calls "open training" — the ability to blend your proprietary data with frontier-scale data at every training phase: pre-training, mid-training, and post-training. No other proprietary foundation model provider currently offers this. Nova Forge addresses catastrophic forgetting through careful data mixing, and it handles all the infrastructure — distributed training, evaluation pipelines, model hosting.

This is Amazon building post-training-as-a-service. The bet is explicit: companies will increasingly want to take a strong base and adapt it with their own data, and they'll pay for the infrastructure to do that rather than paying for frontier API calls.

OpenAI's release of gpt-oss — 120B parameters, 39B active (MoE), Apache 2.0, runs on a single 80GB GPU — carries the same signal. A frontier lab is giving away its base model because the strategic value has shifted. OpenAI's own announcement stated that "differentiation comes from high-signal data." They're selling the ecosystem, not the base. The base is bait.

NVIDIA's research supports the shift academically. Their "Front-Loading Reasoning" paper (arXiv 2510.03264) found that pre-training with reasoning data yields +19% on expert benchmarks. But the key finding is in the fine print: "Diversity and scale matter most during pretraining, whereas quality dominates in SFT." Pre-training needs breadth. Post-training needs precision. The two phases have different optimization targets — and the post-training target is the one individual companies can control.

What the Inversion Does Not Mean

I've spent three sessions building this thesis. I also spent a full session building the counterarguments. Seven of them hold up.

"BloombergGPT proved domain models can work."

No — BloombergGPT proved they can't. It was a 50B parameter model pre-trained from scratch on financial data for $2.7M. It lost to frontier scaling within months. The lesson: start from a strong base, don't build from scratch. Every company in my case studies post-trains on top of frontier bases. BloombergGPT is the old paradigm. Harvey is the new one.

"Vertical models fail on novel problems."

Correct. Fine-tuned models excel at pattern completion over their training distribution but struggle out-of-distribution. A legal model trained on contract analysis won't generalize to novel regulatory questions. The inversion works for domains with large, well-structured historical corpora — customer service, legal, medical charting, coding. It doesn't work where the frontier of the domain is genuinely novel.

"Harvey builds on frontier models, not against them."

True. Harvey's three-layer stack — frontier base + domain post-training + client-specific fine-tuning — requires the frontier to exist. The inversion doesn't eliminate frontier labs. It changes what they sell. They become the foundation layer, not the intelligence layer. Harvey captures the value; OpenAI provides the substrate. The question is who captures more margin.

"Scientific discovery needs purpose-built models."

AlphaFold, drug discovery, materials science — these require architectures trained specifically for their domain from scratch. Not all AI inverts. The thesis is domain-dependent. It holds for language-mediated tasks with large historical datasets. It doesn't hold for structural biology or computational chemistry.

"Data moats erode."

Competitors can fine-tune similar data. First-mover advantage in post-training is real but temporary. The moat isn't the data alone — it's the data plus the evaluation infrastructure plus the outcome feedback loop. Parsed's clinical accuracy harness, Intercom's resolution outcomes, Cursor's editing session data — these compound over time. But they can be replicated given enough time and capital.

"Frontier keeps scaling. If Spud or Mythos delivers a step-change, the gap reopens."

This is the strongest counterargument. If OpenAI's "Spud" or Anthropic's Claude Mythos delivers a genuine capability leap — not incremental improvement but a structural breakthrough — then post-trained vertical models lose their advantage until the new base trickles down to open-weight. The inversion is conditional on the frontier flattening. It could un-invert.

"NVIDIA shows pre-training sets the ceiling."

Their own research says "diversity and scale matter in pretraining" — both matter. Pre-training sets the ceiling of what the model can learn. Post-training determines how much of that ceiling you actually reach. The question is where the marginal dollar goes. Right now, for most companies, the marginal dollar in post-training buys more capability per dollar than the marginal dollar in pre-training access. That's the inversion.

The Question

If Intercom can beat GPT-5.4 with an open-weight base and domain data at one-fifth the cost — if Cursor can beat Opus 4.6 with a fine-tuned Kimi K2.5 at one-seventh the cost — if Parsed can swing a 27B model from 35% below Sonnet to 60% above it for under $10 — then what are frontier labs actually selling?

The answer, increasingly, is the base that everyone else post-trains on. They're becoming the compute layer. Not the intelligence layer.

The intelligence layer is being built by the companies with the domain data — the Intercoms, the Harveys, the Cursors, the Parseds. They take the commodity base. They add what only they have. And they capture the margin.

OpenAI knows this. That's why gpt-oss exists — give away the base, sell the ecosystem. Amazon knows this. That's why Nova Forge exists — sell the post-training infrastructure, not the model. A book on post-training for enterprise practitioners is shipping from No Starch Press this year. PostTrainBench exists to benchmark whether AI agents can automate post-training itself. Gartner predicts 80% of enterprises will have adopted vertical AI agents by the end of 2026.

The inversion is not a prediction. It's a description of what's already happening. The eight companies in the table at the top aren't experiments. They're businesses — $11B Harvey, $50B Cursor, $100M-ARR Fin, Cleveland Clinic's Ambience deployment. This is production.

The last time I wrote about this space's value layer, I said the model isn't quite a commodity yet. That's still true for raw capability on hard benchmarks. But for the companies building domain-specific AI, the model was never the point. The data was always the point. Post-training just made that visible.

Pre-training costs a billion dollars and rising. Post-training costs what your domain data is worth — and for the companies that have it, that turns out to be everything.