Google, Anthropic, OpenAI, and Cursor all spent 2026 advertising how much code their AI writes. David Curlewis makes a sharp argument: every one of those numbers is a volume claim dressed up as an outcome, and the actual productivity research is far messier than the billboards suggest.

The software industry spent roughly two decades learning that lines of code is a terrible way to measure a developer. Counting commits, counting PRs, counting churned lines: all of it got laughed out of serious engineering orgs because volume tells you nothing about value. A senior developer who writes 40% more code than a peer is not automatically better, more impactful, or worth keeping over the other one. You want to know what shipped, what it did for customers, and whether it broke anything.
So it is a little strange that the headline AI engineering metric of 2026 is, functionally, lines of code again.
David Curlewis lays this out in a post worth reading in full, and the core observation is hard to argue with once you see it. Look at what the major AI vendors have been putting on their billboards this year:
- Google: roughly 75% of new code is AI-generated.
- Anthropic: about 80% of merged production code is written by Claude, with engineers shipping "8x more code per quarter."
- OpenAI: also around 80%.
- Cursor: "100M+ lines of enterprise code written per day."
Every single one is a volume claim. Percent of code written by AI is lines of code with a better publicist. And it is no accident that the companies making these claims are the companies selling the tools. Pumping adoption numbers is a commercial necessity for them, not a neutral observation about engineering productivity.
The claims used to be about outcomes
Rewind a few years and the flagship number was different in kind, not just in size. GitHub's big claim for Copilot was that developers completed a task 55% faster. You can pick apart the study design, and plenty of people did, but it was an outcome claim. It was bold, it was falsifiable, and it was about value. If it was wrong, you could go and demonstrate that it was wrong.
The 2026 claims have a different property: they cannot fail. "75% of our code is AI-written" can be true, and the number will keep climbing, regardless of whether anything actually got better. Faster delivery, fewer incidents, happier customers, lower defect rates: none of that is required for the volume number to go up. A volume metric can only disappoint you if adoption stalls, and adoption is the one thing nearly everyone agrees is genuinely happening.
That is the quiet trick. The claims got bigger and started saying less.
The evidence that does not fit on a billboard
What happened in between the outcome era and the volume era is that the outcome evidence got complicated.
The strongest pro-adoption result is still Cui et al.: nearly 5,000 developers, a 26% increase in completed tasks, with the largest gains going to junior developers. That finding is not really in dispute, and it is a real, useful signal.
Then it gets murkier. GitClear reported code churn rising and refactoring activity collapsing as Copilot adoption deepened, which is exactly the pattern you would expect if people are generating more code than they are carefully integrating. Then METR ran the study that skeptics love to quote: experienced open-source developers were 19% slower with AI in codebases they knew well, while believing they were 20% faster. That gap between perceived and measured speed is the genuinely interesting part.
But the story did not stop there. In February 2026 METR effectively walked the result back. Their follow-up estimates flipped toward a speedup, with error bars wide enough to drive a fully loaded motorcycle through, and they abandoned the original study design entirely. The reason is itself a finding: developers now refuse to work without AI, and they cannot reliably self-report time spent on agentic work. METR's current position is that AI probably speeds developers up in 2026, and that we can no longer cleanly measure by how much.
At the organizational level, an NBER survey of around 6,000 executives found 69% of firms actively using AI and roughly nine in ten reporting no measurable productivity impact. The rough cross-study consensus lands somewhere near 10% organizational gains. That is not nothing, and it is genuinely useful. It is also a long way from "you no longer need developers."
The honest reading cuts both ways. If you are still quoting "19% slower" as a gotcha, you are cherry-picking just as hard as the vendor quoting 8x. The research keeps updating. The industry just quietly changed what it counts.
Vanity metrics, now in AI flavor

This is not only a vendor problem. A whole genre of maturity frameworks has sprung up to formalize the substitution. Carnegie Mellon's SEI and Accenture recently launched an AI Adoption Maturity Model: five levels, eight dimensions, marketed off a statistic about 95% of organizations seeing no returns. Steve Yegge's "8 levels of AI-assisted development" ranks you by which tools you run and how much supervision you hand them. And nearly every tools vendor now ships a maturity ladder whose top rung is, conveniently, "use more of our product."
These ladders measure adoption intensity and label it maturity. Same substitution, nicer packaging.
The data point that captures the confusion best: Augment surveyed 219 engineering leaders and asked them to define "AI-native engineering." They got 219 different answers.
And the award for holding both ends of the rope goes to Anthropic, which produced both the "8x more code shipped" claim and one of the more rigorous studies of the year: a randomized controlled trial finding that AI-assisted developers scored 17% lower on comprehension of the code they had just shipped, with no statistically significant productivity gain. The point is not that Anthropic is being dishonest. The products are excellent and plenty of people, Curlewis included, use Claude every day. The point is that their research arm updates its understanding while their marketing arm counts volume, and both can be true at the same time.
Why any of this matters
These numbers are not decorative. They move budgets, reset performance expectations, and drive headcount decisions.
In February, Jack Dorsey cut over 40% of Block's workforce, more than 4,000 people, with AI as the explicit thesis: a significantly smaller team using the tools they are building can do more and do it better. A couple of weeks later, Atlassian cut about 10%, roughly 1,600 people, while conceding it would be disingenuous to pretend AI does not change the mix of skills and the number of roles required.
The detail that should give everyone pause: Dorsey said in the same announcement that the business was strong and gross profit was growing. When a company claims AI made everyone more productive and therefore it needs fewer people, the obvious question is where the evidence is. Show that some measurable share of the workforce is genuinely idle or underutilized because the work can now be done by fewer hands. And even then, name a product company that ever ran out of roadmap. If AI handed you a free productivity increase overnight, the natural move is to ship more value to customers faster, which should show up in monthly active users, conversion, and revenue. Choosing layoffs instead suggests the productivity claim is doing PR work for a decision that was already made for other reasons, whether over-hiring or investor pressure.
Efficiency-driven trimming is a real thing that legitimately happens at every step change in this industry. The argument is narrower and more practical: when you do it, use the individual performance systems you already run, the ones that surface who is cruising and who is disengaged. Not token counts. Not percent of code written by AI. Not someone's rung on a maturity ladder. If your selection evidence is a vanity metric, your selection is a lottery wearing lipstick.
Where this actually lands
None of this is an anti-AI argument, and it should not be read as one. Every engineer should be using AI daily. The industry has absorbed higher-level languages, IDEs, autocomplete, agile, and devops, and each time there were holdouts reminiscing about the good old days before the new thing ruined everything. The holdouts eventually came around. The difference this time is pace. You could delay adopting the cloud for a couple of years and survive. With AI the grace period feels closer to a few months. The way the work gets done has already changed, and there is no sign of it changing back.
But adoption is the starting line, not the scoreboard. The industry already knows how to measure whether engineering is delivering: DORA metrics, reliability, the rate of meaningful change, and ultimately revenue and customer value. That toolkit is battle-tested and boring, which is exactly why it works. Trading it for token counts and AI vanity scores is a downgrade dressed as progress.
The question to carry into your next vendor pitch, executive review, or LinkedIn scroll is short: is that an outcome, or a volume? It is remarkable how fast a confident claim deflates once you ask it. The change is real and the tools are good. The hopeful part is that we already know how to measure what matters, and none of it is counted in tokens. Be AI-first in how you work, and battle-tested in how you measure it.

Comments
Please log in or register to join the discussion