MIT Researchers Show Why Ranking Three Options Beats Comparing Two for Modeling Human Preferences

A team at MIT proved that pairwise comparisons can never reveal correlations between preferences, but asking people to rank three items unlocks them. The result reshapes how systems from Netflix recommendations to LLM alignment should collect their training data.

A method psychologists have leaned on since 1927 just got a mathematical correction that matters for anyone building recommendation systems or training large language models. Researchers at MIT have shown that the standard way of collecting preference data, asking people to choose between two options at a time, throws away information that turns out to be recoverable through a small change: ask people to rank three options instead of two.

The work, presented in April at the International Conference on Learning Representations in Rio de Janeiro, comes from Yeshwanth Cherapanamjeri (a former MIT postdoc now at Nanyang Technological University), Gabriele Farina of MIT's Department of Electrical Engineering and Computer Science, Constantinos Daskalakis of MIT's Computer Science and Artificial Intelligence Laboratory, and PhD student Sobhan Mohammadpour. Their paper, "Learning correlated reward models: Statistical barriers and opportunities", targets a class of tools called random utility models.

What random utility models actually do

Random utility models, or RUMs, trace back to L. L. Thurstone's 1927 paper "A law of comparative judgment." The premise is that when someone picks one option from several, they are selecting the one with the highest internal value, or utility, to them, even though they could never write down a precise number for that value. The models are random because people differ from one another, and a single person's preferences shift over time. Someone who reaches for coffee in the morning and tea after dinner might reverse that on any given day.

"These models are inherently random," Farina explains, "because people are different. Everyone has their own preferences, and even those preferences can vary from time to time."

The practical reach of RUMs goes far beyond beverage choices. Transportation planners use them to predict how commuters reroute when a highway closes for construction, including which alternate roads and modes of transport people will switch to. City governments use them to estimate how to allocate a budget windfall to maximize public benefit. And since the late 1990s, they have been a backbone of the internet economy, sitting underneath the ranking and recommendation engines at companies like Netflix, Amazon, and Google.

The flaw baked into pairwise data

The trouble starts with how these models get estimated in practice. The dominant approach feeds the model pairwise comparisons. Between movie A and movie B, which do you prefer? Between two competing products, which do you buy? This format is popular for a sound cognitive reason.

"Assigning a precise numerical score, such as 4.37, to the benefit you get from a single item is very hard," Daskalakis notes. "Whereas comparing two things, and deciding which one you like better, is cognitively much easier to do."

The catch is what pairwise data leaves on the table. The standard application of RUMs assumes the utilities of A and B are independent. In reality they often are not. A voter who supports gun control is statistically more likely to also support government-funded child care. A viewer who loves independent films probably also leans toward foreign cinema and away from Hollywood action blockbusters. These are correlations between preferences, and they carry real predictive weight.

"With this way of assessing people's preferences, looking at just two things at a time, it is impossible to find correlations between the numerous choices," Daskalakis says. "If a digital platform has a blind eye to the existence of such correlations, it will not be able to estimate preferences very accurately. And if Netflix regularly shows you an assortment of movies you don't care about, you might sign off and cancel your subscription."

Conceptual illustration for A-B Testing and AI-Refined Marketing has a cartoon figure sitting at a computer next to a giant CPU labeled “AI” and two giant cards labeled “A” and “B

The power of three

The MIT team did two things. First, they proved a hard limit: no amount of two-way comparison data can ever recover correlation structure between choices. This is not a question of needing more data or better algorithms. The information simply is not present in pairwise comparisons, and the paper establishes that as a statistical barrier.

Second, they showed where the information does live. When large numbers of people rank three alternatives in order of preference, the correlations become recoverable. The same can be achieved with a mix of best-of-three and best-of-two responses. The jump from two to three is the threshold that matters.

"You would get a bunch of people to rank three items," Mohammadpour explains. "You could then utilize the method we developed for merging those individual results into one big model that can provide us with the big picture."

Much of the contribution is computational rather than purely theoretical. Proving correlations are recoverable in principle is one thing; extracting them efficiently is another. Farina's group focused on algorithms that pull preference structure out of ranked triples and on quantifying how much data the job requires, which is equivalent to asking how many experiments need to run. The encouraging result is that the number of required experiments does not blow up exponentially as the catalog grows. For a recommendation system with thousands or millions of items, exponential scaling would have made the approach useless. Polynomial scaling makes it deployable.

Why this lands on AI alignment

The timing connects directly to how large language models are trained. During alignment, human annotators are routinely asked to rank candidate model outputs, and the model learns from those rankings what tone, style, and content people prefer. That ranking signal feeds a reward model, which is itself a utility model in the RUM family.

Headshot photo of Costis Daskalakis in front of a whiteboard

If the same correlation blindness that limits Netflix recommendations also limits reward models trained on pairwise comparisons, then the standard reinforcement learning from human feedback pipeline may be leaving recoverable preference structure unused. The MIT result suggests a concrete change: collect best-of-three rankings from annotators rather than only A-versus-B judgments, and the resulting reward model can capture correlations that pairwise data cannot express.

"Just as RUMs have been critical to the internet economy since the late 1990s, they are, and will remain to be, critical to the alignment of AI models going forward," Daskalakis says. He frames the broader motivation in terms of scale. "You cannot possibly ask people to communicate all their personal preferences for all possible scenarios. So what you can do instead is build a model that predicts what people think about the different possible outcomes. And you have to keep improving and updating your model in an iterative process until, hopefully, you can make good predictions."

Practical limits and what comes next

The finding is not a free upgrade in every setting. Ranking three items asks more of each respondent than a single binary choice, so the per-response cognitive cost rises even as the information yield improves. Whether that trade favors triples depends on the domain, the annotators, and how much the missing correlations actually matter for the downstream task. The paper gives a roadmap for data collection, not a guarantee that every system needs to switch.

Emma Frejinger, a computer scientist at the University of Montreal, sees the practical payoff clearly. "This paper provides a crucial breakthrough. It mathematically proves why traditional data collection fails and demonstrates that simply asking users for their best-of-three choices unlocks the ability to accurately train these powerful models. This finding provides a highly practical roadmap for collecting better data to drive more accurate optimizations."

For a field that has refined RUMs for nearly a century, the lesson is that the format of the question, not just the volume of answers, sets a ceiling on what a model can learn. Teams building recommendation engines, transportation forecasts, or LLM reward models now have a precise reason to rethink how they pose their comparisons. The choice between collecting pairs and collecting triples is no longer a matter of convenience. It determines whether correlation structure is even available to be learned.

The paper and related work are available through MIT's Laboratory for Information and Decision Systems and CSAIL, and the authors' broader research on game theory and learning can be followed through Gabriele Farina and Constantinos Daskalakis.