Why We Stopped Using the Mathematics That Works
#AI

Why We Stopped Using the Mathematics That Works

Startups Reporter
6 min read

A deep dive into why decision-theoretic AI methods fell out of favor despite their superiority, exploring path dependence, disciplinary silos, and the seductive convenience of deep learning's approach.

The question was simple enough: if the mathematics for building genuinely intelligent agents has existed since the 1960s, why did we stop using it? The answer turns out to be less about technical merit and more about the sociology of knowledge, the economics of convenience, and the peculiar way that entire fields can collectively decide to forget what they already knew.

Featured image

The ImageNet Moment That Changed Everything

In 2012, Alex Krizhevsky submitted a deep convolutional neural network to the ImageNet Large Scale Visual Recognition Challenge. It won by 9.8 percentage points over the nearest competitor. In a field accustomed to incremental improvements measured in fractions of a percent, this wasn't just a win—it was a detonation.

What followed wasn't a reasoned evaluation of competing paradigms. It was a gold rush. Google hired Geoffrey Hinton. Facebook hired Yann LeCun. Baidu hired Andrew Ng. The Canadian Institute for Advanced Research, which had quietly funded neural network research through two decades of indifference, suddenly found its bet paying off spectacularly.

Venture capital followed. PhD students followed the venture capital. Conference papers followed the PhD students. Within five years, deep learning had consumed machine learning almost entirely.

Not because the methods it displaced had stopped working, but because the money, the talent, and the prestige had moved elsewhere. The researchers who understood decision theory, Bayesian inference, and operations research didn't lose their arguments. They lost their audience.

The Geography Problem

There's a structural issue that predates the deep learning boom: the methods that constitute good decision-making under uncertainty are scattered across academic departments that barely speak to each other. Decision theory sits in philosophy and economics. Bayesian statistics sits in statistics departments that spent most of the twentieth century hostile to it. Operations research sits in business schools and industrial engineering. Reinforcement learning, the one branch that maintained contact with mainstream AI, drifted toward deep learning and left its decision-theoretic roots behind.

The result is a kind of intellectual diaspora. The ideas exist; they're published; they're mathematically mature. But no single department teaches them as a coherent toolkit, and no single conference brings their practitioners together. NeurIPS received over 28,000 submissions in 2024. The flagship OR conference gets a few thousand. The researchers haven't disappeared, but they're outnumbered perhaps fifty to one by people who've never encountered their work.

Methods that aren't taught don't get used. This is so obvious it barely needs stating, and yet it explains an enormous amount.

The Specification Problem

There is, however, a more sympathetic explanation for the shift. Decision theory asks you to do something genuinely difficult: write down what you want. A Bayesian decision-theoretic agent needs explicit utility functions, cost models, prior distributions, and a formal description of the action space. Every assumption must be stated. Every trade-off must be quantified.

This is intellectually honest and practically gruelling. Getting the utility function wrong doesn't just give you a bad answer; it gives you a confidently optimal answer to the wrong question.

Deep learning asks for none of this. Collect data, define a loss function (usually cross-entropy or mean squared error, chosen almost by convention), and let the model find patterns. You don't need to specify what you believe about the world; the network will learn its own representations. You don't need to enumerate your objectives; gradient descent will optimise whatever you point it at.

This is enormously convenient, and convenience is underrated as an explanatory variable in the history of ideas. The frequentist victory over Bayesian statistics in the early twentieth century followed a similar pattern: not because frequentist methods were better, but because they were computationally tractable with the tools available. When MCMC methods made Bayesian computation feasible in the 1990s, Bayesian statistics experienced a genuine revival. The ideas hadn't changed. The convenience had.

Deep learning's convenience advantage is the same phenomenon at larger scale. Why specify a prior when you can train on a million examples? Why model uncertainty when you can just make the network bigger? The answers to these questions are good answers, but they require you to care about things the market doesn't always reward.

Good Enough

And here we arrive at the commercial reality. For most applications that generate revenue, a system that gets the answer roughly right is more valuable than a system that gets it optimally right but took three times longer to build. The gap between "approximately correct" and "decision-theoretically optimal" is real, but it lives in the tails: the edge cases, the adversarial inputs, the distributional shifts that happen at 3am on a Sunday.

By the time those matter, the product has shipped and the team has moved on. This is the VHS-versus-Betamax dynamic, or TCP/IP versus the OSI model, or QWERTY versus every ergonomic alternative proposed since 1936. The technically superior solution loses to the solution that's easier to deploy, easier to hire for, and good enough for the use cases that pay the bills.

Decision theory is Betamax: genuinely better in measurable ways that most buyers don't measure. The market rarely punishes you for confusing correlation with causation. At least, not immediately. The punishment comes later, in the form of systems that can't adapt when conditions change, recommendations that optimise engagement metrics while destroying the thing they're meant to recommend, and autonomous agents that query every tool for every question because they have no concept of whether a query is worth its cost.

But these are slow-moving consequences, and quarterly earnings reports are fast-moving incentives.

Fashion in Methodology

Bayesian methods have been going in and out of fashion since Thomas Bayes himself, who derived his theorem in the 1740s and then, apparently finding the result unsatisfying, never published it. Richard Price published it posthumously. Laplace rediscovered and extended it. Fisher attacked it. Jeffreys defended it. The entire twentieth century was an argument about whether you were allowed to have prior beliefs, conducted with a passion that suggests the participants understood the stakes were more theological than statistical.

MCMC brought Bayesian methods back in the 1990s. Deep learning pushed them out again in the 2010s. Now, in the mid-2020s, the limitations of pure pattern-matching are becoming visible enough that probabilistic methods are creeping back: Bayesian neural networks, conformal prediction, probabilistic programming languages. The wheel turns.

I find this cyclical pattern more explanatory than any single technical argument. Ideas in methodology don't win or lose on merit alone. They win when there's a community to teach them, hardware to run them, problems that visibly need them, and advocates charismatic enough to attract funding. Deep learning had all four simultaneously. Decision theory had the mathematics but lacked the rest.

What Might Change

The methods never stopped working. The Bayesian agent I built uses nothing more exotic than Beta distributions and expected-value-of-information calculations. It outscored a LangChain agent by 120 points not because it was smarter, but because it could answer a question that LangChain cannot even pose: is the next tool query worth its cost?

Whether this matters beyond a toy benchmark remains to be seen. But the question the commenter asked, why we stopped using these methods, has a clear answer: not because they failed, but because something shinier came along, and the institutions that transmit knowledge reorganised themselves around the new shiny thing.

It's happened before. It will happen again. The interesting question is whether, this time, we'll remember what we already knew.

Comments

Loading comments...