The Turing Award winner behind reinforcement learning argues that systems trained by supervised learning can be novel or good, but never both at once. His fix for real scientific discovery comes down to three steps that large language models skip entirely.
Richard Sutton, the reinforcement learning pioneer who shared the 2024 Turing Award with Andrew Barto, has spent decades arguing that AI progress comes from computation and search rather than from hand-built human knowledge. In a recorded talk for the SAIR Foundation workshop on Science for AI, posted to X on May 31, he turned that lens on the technology currently absorbing most of the industry's capital: generative AI. His conclusion is blunt. Models trained by supervised learning, he says, are structurally incapable of making novel discoveries.
{{IMAGE:1}}
The joke that frames the argument
Sutton opens with an old academic jab. A reviewer evaluates a researcher's work and writes back: "This work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good." That line, he argues, describes a large share of what we now call generative AI, the category that includes large language models, image and video generators, and the newer world-model systems.
The mechanism is straightforward. These systems ingest enormous numbers of examples and produce a model that behaves like those examples. They generate text like people, images like artists, video like the material already on the internet. When the output is good, the goodness is inherited from the source material. When the output is novel, it has stepped beyond that material into territory we usually call hallucination. The randomness baked into each generation step means a model can take a different path every time, but that path is either grounded in the training data, and therefore good, or it is a shot in the dark, and therefore novel. Sutton's claim is that it cannot be both at the same moment.
He is careful not to dismiss the technology. Mimicry is genuinely useful when it is faster, cheaper, smaller, more customizable, or easier to copy than the thing being imitated. For summarizing a document or pulling an answer off the web, novelty is the last thing anyone wants. The problem only becomes fatal in the specific domains the SAIR workshop cares about: science and mathematics, where the entire point is to produce something both new and correct.
What separates the systems that actually discovered something
Sutton's interesting move is to name the systems that have crossed that line. AlphaGo's move 37 against Lee Sedol. AlphaZero's chess style, which looked alien to centuries of human theory. Sony's GT Sophy outracing human drivers in simulation. AlphaFold and AlphaProof. He even cites Claude Code as having produced real advances in programming. The common thread, he argues, is not bigger pattern recognition. It is a capability that supervised learning cannot supply on its own.
He calls it Discovery, and admits the name is imperfect. The idea itself is ancient and almost embarrassingly simple: try many things, see which ones work, keep the winners. Evolution by natural selection runs on it. The scientific method runs on it. Ordinary learning runs on it. Psychology calls it operant conditioning; machine learning calls it reinforcement learning, the field Sutton co-authored the standard textbook on. Planning and combinatorial search express the same generate-and-test loop.
{{IMAGE:2}}
The formal version has three steps: variation, evaluation, and selective retention. Sutton credits earlier thinkers including Donald Campbell, Daniel Dennett, and Gary Cziko for the underlying insight. His contribution is to map it directly onto modern AI and show exactly where generative systems fall short.
The missing step is evaluation
Here is the crux. Generative models do have variation. The stochastic sampling at each step gives them plenty of trajectories to choose from. What they lack at runtime is evaluation. The generator was frozen during pre-training, so when it produces something novel, nothing inside the system judges whether that novelty is any good. Without evaluation there can be no selective retention, and without retention there is no discovery. The new idea, as Sutton puts it, flickers into existence and then flickers away unrecognized.
Evaluation has to come from somewhere. Sometimes it comes from a person, as when you generate a dozen images and pick your favorite, in which case the human and the AI together complete the discovery. Sometimes it comes from a hard objective: a move that leads to checkmate, a step that completes a proof, an action that earns reward, a theory that fits the data. Only that second case, Sutton notes, yields full autonomy, because only then does the system contain its own judge.
He also defends the role of blindness in the variation step. A good scientist does not pick experiments at random, so the variation is partly informed. But it cannot be fully determined either. Some genuine uncertainty about where the answer lies is what makes the result a discovery rather than a lookup.
Why backpropagation alone doesn't get there
Sutton anticipates the obvious objection. Deep networks are trained by backpropagation, and gradient descent is deterministic, so where is the variation? His answer is that it hides in the random initialization of the weights, a detail often treated as a footnote but which he says is a real and necessary form of variation. The catch is that this variation happens exactly once, at startup. As training proceeds the network can lose its capacity to keep learning.
This is where he points to his own group's work. A 2024 Nature paper introduced "continual backpropagation," which makes one modest change: every so often a rarely used neuron gets re-initialized to small random weights. That keeps variation flowing and preserves the plasticity that standard deep learning quietly burns through. It is a small mechanical fix aimed at a structural gap.
The call to arms, and what it means for builders
Sutton ends with a pitch rather than a verdict. If the industry wants AI scientists that genuinely create, it needs to give those systems explicit goals they can optimize against, so the systems can vary, evaluate, retain, and discover on their own. "Let's fully automate creativity and discovery," he says.
For anyone funding or building in this space, the argument cuts against the dominant bet. Enormous sums are flowing toward scaling supervised pre-training, on the assumption that a large enough mimic eventually becomes a creator. Sutton is saying that is a category error. Scale makes the mimicry better, faster, and cheaper, all real and valuable things, but it does not install the evaluation loop that discovery requires. The companies that have produced authentically new results, from DeepMind's game and protein systems to verified math proofs, all bolted on some form of search and reward. The ones selling pure next-token prediction as a path to invention are, by this reading, selling a very good copy machine.
Whether the field agrees is another matter. Sutton flagged the view as "possibly controversial," and the talk drew hundreds of replies and more than 600,000 views within days. The disagreement tends to land on whether techniques like reinforcement learning from human feedback and the newer reasoning models already smuggle evaluation back in, blurring the clean line he draws. His framework at least gives the debate a precise question to argue over: not whether a model is large or fluent, but whether anything in the loop is judging its novel outputs and keeping the ones that work. The full text and video are worth reading in his own words.
Comments
Please log in or register to join the discussion