Why hasn't longer-horizon training slowed AI progress?

New analysis of AI progress trends argues that efficiency gains, flawed intelligence metrics, and non-compute model traits explain why longer training horizons haven't slowed development, countering common assumptions about compute scaling limits.

Dwarkesh Patel, a podcaster known for prepping extensively for technical interviews with AI researchers and engineers at dwarkeshpatel.com, recently launched a public call for answers to four core unresolved questions about AI development. Top respondents will be eligible for research collaborator roles, making the call part job interview, part community challenge. The first question he posed has sparked particular debate among developers and AI watchers: why hasn't AI progress slowed down, despite expectations that longer-horizon reinforcement learning training would require far more compute and time per iteration?

Sean Goedecke, a developer and commenter on AI engineering topics, published a detailed response to the question this week. His post breaks down the gap between the theoretical case for slowing progress and the observed acceleration in model capabilities, drawing on examples from frontier model training and recent benchmark failures.

The logic behind the expectation of slower progress goes like this. Training models via reinforcement learning requires the model to complete a task, then receive a grade or reward signal based on the output. As models grow more capable, the tasks they train on get harder, which means they take more FLOPs to complete. FLOPs, or floating-point operations, are the basic unit of GPU work, mostly made up of the matrix multiplications that power all modern AI model training and inference. More FLOPs per task mean more FLOPs per training run, which should mean longer training times and slower iteration. But the METR horizon-length graph, a widely cited benchmark that tracks the maximum task length models can handle over time, shows the exact opposite. Models are handling increasingly complex, long-horizon tasks at an accelerating rate, with no sign of the predicted slowdown. You can view the latest METR data and methodology at metr.org.

The first reason progress hasn't slowed is that newer models squeeze far more value out of every FLOP than their predecessors. AI labs aren't scaling up their GPU clusters by orders of magnitude as quickly as model capabilities grow, partly because physical datacenter expansion has hard limits on how fast new hardware can be deployed. Instead, most efficiency gains come from eliminating massive, avoidable waste in training code. Goedecke cites a well-documented example from the initial GPT-4 training run, where the team used FP16 floating-point precision when summing many small values. This is a common pitfall in numerical computing, as FP16 can only represent so much precision, and summing many small values can lead to total loss of accuracy when the cumulative sum grows large. Fixing bugs like this can improve training efficiency per FLOP by 100x or more, easily offsetting any inherent efficiency loss from training more powerful models. Training code for frontier models is so complex that even obvious-in-hindsight mistakes can waste massive amounts of compute, so incremental engineering fixes add up to huge gains over time.

A second factor is that humans are reliably bad at judging AI intelligence, especially as models approach human-level performance. People measure intelligence on an uneven scale. It is easy to tell when a model is less smart than you, because you can directly observe it making mistakes you wouldn't make. It is far harder to tell when a model is smarter than you, because you are the one making errors. In those cases, you have to rely on long-term outcome data, or cases where the model corrects you after initial confusion, to realize it has superior reasoning. The jump from GPT-3 to GPT-4 felt massive because GPT-3 was less capable than almost all humans, while GPT-4 matched human performance in many common domains. Now that frontier models are operating in ambiguous areas where even domain experts disagree, it is nearly impossible to pin down the actual rate of raw intelligence growth. It is possible that underlying intelligence gains have slowed, but we do not have good tools to measure that yet, since our benchmarks lag behind model capabilities.

A third, often overlooked factor is that intelligence is not the only trait that determines model capabilities. Many other characteristics play a role, including working memory size, familiarity with tool harnesses, ability to attend to long context windows, and even personality traits that affect willingness to persist through long tasks. Take the October 2024 shift where OpenAI and Anthropic models suddenly became far more agentic, able to handle complex end-to-end tasks like multi-file code refactoring or research synthesis without constant human intervention. That shift may reflect higher raw intelligence, but it could also come from any of the non-intelligence traits listed above, many of which can be improved with targeted tweaks rather than massive compute spending. Goedecke points to Apple's widely criticized 2024 paper 'The Illusion of Thinking' as a clear example of confusing persistence with intelligence. The paper tested models on Tower of Hanoi puzzles with increasing numbers of disks, judging reasoning ability by whether they could solve the puzzle via step-by-step reasoning. When you examine the model outputs, almost all failures came from models recognizing the task would take hundreds of steps and refusing to attempt it, not from inability to solve the puzzle logic. The same models could write code to solve the Tower of Hanoi instantly, or correctly complete smaller subsets of the steps with perfect accuracy. The gap was persistence, not intelligence. A model that gives up when a task looks too long is not less smart, it just lacks training to work through multi-step sequences without self-censoring.

For developers building with AI, especially those training custom models or deploying agentic systems, these points cut through both hype and pessimism. The common narrative that AI progress is purely a function of compute scaling is oversimplified, which means small teams with efficient, bug-free training code can still make meaningful advances without access to massive GPU clusters. The distinction between intelligence and other traits like persistence also explains why some models perform well on static benchmarks but fail at real-world agentic tasks, and how tweaking training data or system prompts can unlock capabilities that raw compute scaling cannot. Developers working on coding agents or task automation tools may find that improving model persistence or tool familiarity delivers more practical gains than pushing for slightly higher raw intelligence.

Goedecke's full post is available at seangoedecke.com, and has already sparked active discussion on Hacker News and AI developer forums. Commenters are debating the relative role of training efficiency versus new architectural advances, with many sharing their own experiences with wasted compute from avoidable training bugs. Some note that the difficulty of measuring model intelligence as they approach human-level performance makes it hard to separate real technical progress from marketing claims by large AI labs. Others point to recent agentic coding tools like Claude Code and GitHub Copilot's agentic features as evidence that non-intelligence traits are driving the latest wave of useful AI applications for developers. Dwarkesh Patel has not yet announced the winners of his question call, but the public discussion around these topics has already surfaced detailed, engineering-focused perspectives that are rare in mainstream AI coverage.

The core takeaway is that broad theories about AI progress often fail to account for the messy reality of engineering work. Progress is driven by random bug fixes that save 100x compute, clever ideas that make models 100x more useful, and spiky capability gains that improve performance in one area while leaving others unchanged. Even inside large AI labs, teams do not have a clear count of how many FLOPs are wasted on bugs versus used for actual training. As models grow more capable, it will only get harder to measure true progress, but for now, the gap between expected and observed slowdowns comes down to the same factors that drive all software engineering: efficiency, clear metrics, and attention to the details that actually affect performance.

#AI #Machine Learning #LLMs #Compute #Training Efficiency

Why hasn't longer-horizon training slowed AI progress?

Comments