Teaching AI models to say 'I'm not sure'

MIT researchers develop RLCR technique that trains AI models to provide calibrated confidence estimates alongside their answers, reducing calibration error by up to 90% while maintaining accuracy. This addresses a fundamental flaw in current AI training that leads to overconfidence and potential hallucinations.

Confidence is persuasive. In artificial intelligence systems, it is often misleading. Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. This overconfidence isn't just a technical quirk—it has real consequences when these systems are deployed in high-stakes environments.

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have now traced that overconfidence to a specific flaw in how these models are trained, and developed a method that fixes it without giving up any accuracy. The technique, called RLCR (Reinforcement Learning with Calibration Rewards), trains language models to produce calibrated confidence estimates alongside their answers. In addition to coming up with an answer, the model thinks about its uncertainty in that answer, and outputs a confidence score.

In experiments across multiple benchmarks, RLCR reduced calibration error by up to 90 percent while maintaining or improving accuracy, both on the tasks the model was trained on and on entirely new ones it had never seen. The work will be presented at the International Conference on Learning Representations later this month.

The Root Cause of AI Overconfidence

The problem traces to a surprisingly simple source. The reinforcement learning (RL) methods behind recent breakthroughs in AI reasoning, including the training approach used in systems like OpenAI's o1, reward models for getting the right answer, and penalize them for getting it wrong. Nothing in between.

A model that arrives at the correct answer through careful reasoning receives the same reward as one that guesses correctly by chance. Over time, this trains models to confidently answer every question they are asked, whether they have strong evidence or are effectively flipping a coin.

At left, a confused cartoon robot is thinking. At right, the same robot speaks confidently.

That overconfidence has consequences. When models are deployed in medicine, law, finance, or any setting where users make decisions based on AI outputs, a system that expresses high confidence regardless of its actual certainty becomes unreliable in ways that are difficult to detect from the outside. A model that says "I'm 95 percent sure" when it is right only half the time is more dangerous than one that simply gets the answer wrong, because users have no signal to seek a second opinion.

"The standard training approach is simple and powerful, but it gives the model no incentive to express uncertainty or say I don't know," says Mehul Damani, an MIT PhD student and co-lead author on the paper. "So the model naturally learns to guess when it is unsure."

How RLCR Works

RLCR addresses this by adding a single term to the reward function: a Brier score, a well-established measure that penalizes the gap between a model's stated confidence and its actual accuracy. During training, models learn to reason about both the problem and their own uncertainty, producing an answer and a confidence estimate together.

Confidently wrong answers are penalized. So are unnecessarily uncertain correct ones. The math backs it up: the team proved formally that this type of reward structure guarantees models that are both accurate and well-calibrated.

"What's striking is that ordinary RL training doesn't just fail to help calibration. It actively hurts it," says Isha Puri, an MIT PhD student and co-lead author. "The models become more capable and more overconfident at the same time."

The researchers tested the approach on a 7-billion-parameter model across a range of question-answering and math benchmarks, including six datasets the model had never been trained on. The results showed a consistent pattern. Standard RL training actively degraded calibration compared to the base model, making models worse at estimating their own uncertainty. RLCR reversed that effect, substantially improving calibration with no loss in accuracy.

A robot holds a lightbulb, surrounded by chalkboards.

The method also outperformed post-hoc approaches, in which a separate classifier is trained to assign confidence scores after the fact. This is significant because many current attempts to address calibration problems focus on post-processing rather than changing the fundamental training approach.

Practical Applications and Implications

The team also demonstrated that the confidence estimates produced by RLCR are practically useful at inference time. When models generate multiple candidate answers, selecting the one with the highest self-reported confidence, or weighting votes by confidence in a majority-voting scheme, improves both accuracy and calibration as compute scales.

This has direct applications in fields where AI decisions have real consequences:

Medical diagnosis: AI systems that can express uncertainty could flag cases where human review is needed
Financial analysis: Confidence estimates could help prioritize which AI-generated insights require further validation
Legal research: Systems that acknowledge uncertainty could prevent over-reliance on potentially flawed legal precedents
Autonomous systems: Self-driving vehicles or robotic systems that communicate their confidence levels could make safer decisions

An additional finding suggests that the act of reasoning about uncertainty itself has value. The researchers trained classifiers on model outputs and found that including the model's explicit uncertainty reasoning in the input improved the classifier's performance, particularly for smaller models. The model's self-reflective reasoning about what it does and doesn't know contains real information, not just decoration.

Limitations and Future Directions

While RLCR represents a significant advance, it's not without limitations. The method currently requires additional computation during training to calculate the confidence estimates. The researchers also note that while the approach works across diverse benchmarks, further testing is needed in domain-specific applications where calibration might have different requirements.

Additionally, the question remains how to handle situations where the model genuinely doesn't know the answer versus when it knows the answer but isn't confident. The current approach treats these scenarios similarly, but they may require different handling in practice.

Graphic of a human brain made from computer nodes, with abstract patterns resembling computer parts in the background

Looking ahead, the researchers suggest several promising directions:

Extending RLCR to multimodal models that process both text and images
Developing methods for interactive calibration where users can provide feedback on confidence estimates
Exploring how confidence estimates could be used to guide more efficient computation
Investigating whether similar approaches could work for other types of AI systems beyond language models

In addition to Damani and Puri, other authors on the paper are Stewart Slocum, Idan Shenfeld, Leshem Choshen, and senior authors Jacob Andreas and Yoon Kim. The work builds on previous research from the same group on improving AI reasoning capabilities.

As AI systems become increasingly integrated into critical decision-making processes, the ability to accurately express uncertainty isn't just a technical improvement—it's a necessary step toward building more reliable and trustworthy AI. The RLCR method provides a practical path toward that goal, offering a way to maintain model performance while adding a crucial layer of self-awareness.

For more details on the technical approach, you can refer to the paper "Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty" which will be presented at the International Conference on Learning Representations. The researchers have also made their code and models available through their lab's GitHub repository.

Teaching AI models to say 'I'm not sure'

The Root Cause of AI Overconfidence

How RLCR Works

Practical Applications and Implications

Limitations and Future Directions

Comments