8 min read

The Real Product

The Real Product

In 2005, Google figured out something that took the rest of the industry a decade to understand. The product wasn't search. Search was the mechanism. The product was the behavioral data that billions of searches generated — what people wanted, when they wanted it, and how they refined their queries when they didn't find it. Google gave away search and sold the exhaust.

Twenty-one years later, the same structure is running through AI coding tools. And almost nobody is talking about it.

Three Grabs in Ninety Days

Between April and August 2026, three of the largest platforms in software development changed their data policies in the same direction, within months of each other, using the same architecture.

GitHub Copilot went first. On April 24, Microsoft updated its privacy statement: interaction data from Copilot Free, Pro, and Pro+ users — prompts, suggestions, accepted code, rejected code, file context, navigation patterns — would be used to train AI models by default. Previously, this required opt-in consent. The change flipped it. One hundred fifty million developers on the platform. 232 downvotes and 59 thumbs-down in the community FAQ. No per-repository control. Opt-out rates undisclosed.

Cursor was already there — but the destination shifted. In April, SpaceX announced a $60 billion acquisition option for Anysphere, Cursor's parent company. The alternative: a $10 billion collaboration deal. The technical rationale became clear when xAI confirmed that Grok V9-Medium — 1.5 trillion parameters, mid-June release — was trained on Cursor workflow data. Not code from repositories. Workflow data: how developers debug, refactor, accept, reject, iterate. Two senior Cursor engineers departed to xAI, reporting directly to Musk. The acquisition option isn't for the IDE. It's for the data pipeline.

Atlassian went widest. Starting August 17, data from Jira, Confluence, Jira Service Management, and other cloud products will be used to train Atlassian's AI offerings. Three hundred thousand customers. Not just code — project structure. Sprint data. Issue descriptions. Documentation. The full lifecycle of how software gets planned, tracked, and delivered. The Register called it what it was: Atlassian reversed a prior policy that explicitly stated customer data would not be used for AI training.

Three platforms. Three policy shifts. One direction.

What They're Actually Collecting

GitHub has roughly 420 million public repositories. Every major model has trained on that corpus already. The code that survived — the finished artifacts, the clean commits — has been consumed. It's not scarce. It's commodity.

What's scarce is the process data. How a developer approaches a bug. Which suggestion they accept. Which they reject and why. The dead ends, the rollbacks, the multi-file edits spanning hours, the moment when someone takes over the AI's draft and reshapes it. RevolutionInAI put it directly: "The bottleneck shifted from text abundance to process scarcity. GitHub shows what survived; developer logs show HOW engineers actually work."

This is what Cursor captures and xAI is buying. This is what Copilot's new policy collects. And Atlassian extends the capture beyond the editor into the entire development lifecycle — the decisions that happen before and after code is written.

What GitHub repos contain
Finished code. Clean diffs. Merged PRs. What survived.
What workflow data contains
Wrong turns. Rejected suggestions. Debugging sessions. How the engineer actually thinks.
Available to everyone
420M+ public repos, fully consumed by all major model providers
Available only to platform owners
Cursor: 4M+ developers. Copilot: 150M+ users. Atlassian: 300K orgs.

The distinction matters because it explains the valuations. Cursor's $2 billion ARR and $50-60 billion valuation aren't justified by an IDE that wraps API calls to someone else's model. They're justified by the data those API calls generate. Every accepted suggestion, every rejection, every multi-step edit session is a training signal that no competitor can replicate without building the same user base.

Privacy by Paywall

The consent architecture is identical across all three platforms, and it tells you exactly who the customer is.

Cursor: Free and Pro plans have Privacy Mode off by default. Roughly half of users have it enabled — meaning about 2.5 million developers' workflow data flows into the training pipeline. Business and Enterprise: Privacy Mode on by default, with Zero Data Retention agreements enforced with upstream model providers.

Copilot: Free, Pro, and Pro+ users opted in by default as of April 24. Business and Enterprise exempt under existing contract terms.

Atlassian: Free and Standard tier customers cannot opt out of metadata collection at all. Enterprise tier has both metadata and content collection off by default.

The pattern: individuals and small teams generate training data. Enterprises pay to be excluded.

This isn't a bug in the business model. It is the business model. The free or cheap tier isn't subsidized out of generosity. It's subsidized because each free user is a training data source. The enterprise tier isn't expensive because the compute costs more. It's expensive because the customer is no longer paying with their data — they're paying with money instead, and the platform loses the training signal.

Google Search worked exactly this way. The free product trained the algorithm that created the moat. The enterprise version (Google Workspace, Google Cloud) charged money for privacy guarantees. Twenty years later, the consent architecture hasn't changed. Only the domain has.

The Flywheel

The mechanism is circular and self-reinforcing. More developers use the tool. The tool collects more workflow data. The data trains a better model. The better model attracts more developers. Each cycle widens the gap between the platform owner and any competitor who doesn't have the pipeline.

Microsoft understood this early. MAI-Code-1-Flash — their first in-house coding model, 5 billion parameters, announced at Build 2026 — was trained not on public code but against the production Copilot harness. 4.7 million paying users generating workflow data daily. The model scored 51.2% on SWE-bench Pro while using 60% fewer tokens. Microsoft isn't just building a model. They're building a model from their product's data, then feeding that model back into the product. The harness trains the model; the model improves the harness; the improved harness generates better training data.

xAI is doing the same thing from the outside. Grok V9-Medium, trained on Cursor data, will ship mid-June. If the $60 billion acquisition closes, xAI won't just have a model — they'll have the pipeline. The two senior Cursor engineers who moved to xAI aren't building an editor. They're building the data infrastructure that connects the editor to the training loop.

This is where the commodity thesis I've been tracking since article #32 arrives at its structural conclusion. When models commoditize — and they are commoditizing, with five systems above 80% on SWE-bench Verified — the scarce resource isn't the model. It's the data that trains the next model. The value chain has descended one more level: pre-training → post-training → scaffolding → verification → data.

The Flywheel Eats Itself

There's a problem with this machine, and it's the reason this piece isn't a neutral observation about business models.

The flywheel depends on human developers generating process data. Accept a suggestion. Reject a suggestion. Debug a failed approach. Refactor across files. Each of these actions is a training signal precisely because a human made a judgment call. Synthetic data — AI reviewing AI output — can't replicate this. Shumailov et al. (Nature, 2024) showed that even a 1/1000 fraction of synthetic data in a training corpus degrades model performance. Recursive training on model-generated content produces what they called "model collapse" — progressive loss of the distributional tails that make the output useful.

So the flywheel needs human developers. But the flywheel is eliminating human developers.

Harvard Business School tracked 62 million workers across 285,000 firms: junior developer hiring is down 67% since 2022. CS graduates face 6.1% unemployment — nearly double the national average. Stack Overflow contributions are declining. 74% of new web pages contain AI-generated content. The humans whose judgment the flywheel needs are being removed from the pipeline that would develop their judgment.

Ivan Turkovic named this the "Training Data Paradox" in March 2026: AI consuming the knowledge produced by human engineers while eliminating the conditions that produced it. Classic tragedy of the commons. The shared resource — human engineering expertise — is being harvested by every platform simultaneously, while the investment in producing new expertise (hiring, mentorship, junior roles) is being cut.

The sequence:

  1. Tool collects workflow data from human developers
  2. Data trains a better model
  3. Better model reduces demand for junior developers
  4. Fewer juniors entering the pipeline
  5. Senior developers retire or burn out — 88% of heavy AI users report increased burnout
  6. Fewer humans generating the process data the flywheel needs
  7. Training corpus shifts toward synthetic and model-generated content
  8. Model collapse risk increases

The flywheel that creates the moat depletes the resource that feeds it.

What This Means for the $60 Billion Question

SpaceX's $60 billion acquisition option for Cursor makes sense only if the data pipeline remains valuable. The pipeline remains valuable only if millions of skilled developers continue using the tool and generating process data. The tool is making the investment in creating those developers look irrational quarter by quarter.

This isn't a distant risk. GitHub commits have tripled from 300 million in 2023 to 1.4 billion in early 2026. 51% of committed code is now AI-generated or substantially assisted. The volume of code is exploding while the population of humans capable of generating meaningful training signals is contracting. Right now, the flywheel is spinning fast because the existing workforce is large enough to feed it. The question is what happens in three to five years, when the junior developers who weren't hired in 2024-2026 should have been becoming the mid-level engineers whose workflow data is most valuable.

Google's data flywheel worked because searching doesn't require expertise — more users always means more data. The coding data flywheel is different. It works only as long as the users are skilled enough for their workflow to be a useful training signal. The more successful the flywheel is, the fewer such users it produces.

Every data collection policy announced this spring is rational for the company that announced it. Individually, each platform is building the strongest moat available. Collectively, they are draining a shared resource that none of them are replenishing. The tragedy of the commons doesn't require malice. It only requires that each actor optimizes locally while the shared resource degrades globally.

The real product was never the coding assistant. The real product is the training data the assistant generates. And the machine that harvests that data is running at peak capacity on a resource it's helping to exhaust.