MIT researchers reveal how aggregated performance metrics mask critical failures in ML models when deployed in new environments, demonstrating that top-performing models can become the worst performers for significant patient subgroups.

Machine learning models often appear highly accurate during development, only to fail unexpectedly in real-world deployment. New MIT research exposes how reliance on aggregated performance metrics hides dangerous flaws that emerge when models encounter new data distributions. The findings, presented at NeurIPS 2025, demonstrate that models showing excellent average performance in their training environment can catastrophically fail on 6-75% of cases in new settings—a phenomenon obscured by standard evaluation methods.
The Illusion of Average Performance
Medical diagnostics provides a stark example. A chest X-ray model trained at Hospital A might show 95% accuracy overall when deployed at Hospital B. Yet MIT's analysis reveals this aggregate figure masks critical failures: The same model could misdiagnose up to 75% of patients with specific conditions like pleural effusions or heart enlargement. "We demonstrate that even when you train models on large amounts of data and choose the best average model, in a new setting this 'best model' could be the worst model for substantial portions of new data," says Marzyeh Ghassemi, MIT associate professor and senior author of the study.
The root cause lies in spurious correlations—patterns learned from training data that appear predictive but collapse in new environments. For instance:
- X-ray models might associate hospital-specific artifacts (like scanner markings) with diseases
- Diagnostic tools could link patient demographics (age, race) with medical conditions
- Hate-speech detectors may falsely associate neutral phrases with toxicity based on platform-specific contexts

"We want models to learn anatomical features, but anything correlated with decisions in the training data can be exploited," explains lead author Olawale Salaudeen, MIT postdoc. "Those correlations often disintegrate in new environments, turning reliable models into hazardous ones."
Unmasking Hidden Failures
Traditional evaluation assumes models ranked best-to-worst maintain that order in new settings—a principle called accuracy-on-the-line. The MIT team shattered this assumption by developing OODSelect, an algorithm that identifies failure-prone subgroups within new datasets. The method works through:
- Training thousands of models on source data
- Measuring their accuracy on target (new) data
- Pinpointing subgroups where top-performing source models become worst performers

When applied across medical imaging, pathology slides, and hate-speech detection, OODSelect consistently revealed "blind spots" where aggregated metrics hid severe performance drops. In chest X-rays alone, it identified patient subgroups where supposedly accurate models failed at rates 300% higher than random guessing.
Beyond Medical Applications
The implications extend far beyond healthcare:
- Content moderation: Models trained on one platform's hate speech patterns fail catastrophically when applied to linguistic nuances of another platform
- Autonomous systems: Perception models relying on location-specific visual cues (e.g., street signs) become unreliable when deployed elsewhere
- Financial AI: Fraud detection systems trained on regional transaction patterns collapse when faced with cross-border behavioral differences

"Aggregate statistics are seductive but dangerous," Salaudeen notes. "They create an illusion of robustness while hiding systemic failures affecting vulnerable subgroups."
Pathway to Trustworthy AI
The researchers advocate for three fundamental shifts in ML development:
- Subgroup-Centric Evaluation: Replace monolithic metrics with granular subgroup analysis using tools like OODSelect
- Failure-Driven Design: Actively seek and address worst-case performance scenarios during development
- Continuous Validation: Implement ongoing monitoring for distribution shifts post-deployment
All code and identified failure subgroups from the study are publicly available through the Laboratory for Information and Decision Systems, providing benchmarks for future research. "We hope this becomes a steppingstone toward models that confront spurious correlations head-on," Ghassemi states.

As machine learning penetrates critical domains, this research sounds an urgent alarm: Trusting averages is no longer sufficient. Only by dissecting performance across subgroups can we build AI systems truly robust to the complexities of the real world. The team's NeurIPS 2025 paper, "Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations", provides both the diagnostic tools and theoretical framework needed to begin this essential transition.

Comments
Please log in or register to join the discussion