LLMs fail in 8 out of 10 early differential diagnosis cases • The Register

AI models struggle with early medical diagnosis, getting it wrong 80% of the time despite high accuracy on final diagnoses.

AI models are increasingly being marketed as diagnostic tools, but new research shows they fail at early differential diagnosis in more than 80% of cases, raising serious concerns about their use in healthcare settings.

AI models excel at final diagnosis but struggle with uncertainty

A team led by Harvard medical student Arya Rao tested 21 leading off-the-shelf AI models using 29 standardized clinical vignettes published in JAMA Network Open. The results revealed a stark contrast between different stages of medical diagnosis.

When provided with complete medical information and asked to make a final diagnosis, the leading models achieved an impressive 91% accuracy rate. However, the picture changes dramatically when examining early differential diagnosis - the critical stage where clinicians weigh various possibilities and rule out certain conditions.

"Every model we tested failed on the vast majority of cases," Rao told The Register. "That's the stage where uncertainty matters most, and it's where these systems are weakest."

The danger of false confidence in medical AI

The research highlights a fundamental problem: AI models can project confidence even when their reasoning is flawed. This is particularly concerning in differential diagnosis, where navigating uncertainty is essential.

Dr. Marc Succi, a radiologist at Massachusetts General Hospital and paper coauthor, emphasized that this false confidence can exacerbate patient anxiety. "They can project confidence without showing robust reasoning, especially around differential diagnosis," Succi explained. "Such confidence can further inflame the worries of patients with stress and anxiety issues."

Partial correctness vs. complete failure

While the headline statistic of 80% failure is alarming, Rao noted that the reality is somewhat more nuanced. When measured by raw accuracy as a proportion correct in each case, the models performed significantly better, ranging from 63% to 78% accuracy.

"The raw data suggests that models were often partially correct, getting some but not all of the right answers, even when they failed to produce a fully correct differential under the stricter failure-rate definition," Rao explained.

However, the research team maintains that even this more generous interpretation shouldn't be reassuring. The stricter failure-rate definition deserves attention precisely because AI bots are being marketed as frontline medical care agents.

Real-world implications for patient care

The consequences of AI failures in early diagnosis extend far beyond simple misdiagnosis. Succi warned that incorrect differentials can lead to:

Delays in appropriate care
Unnecessary procedures with potential complications
Increased healthcare costs
Patient anxiety and stress

"Even if you get to the final answer eventually, the wrong differential can result in delays in care, unnecessary procedures with complications, high costs, and much more," Succi said.

The marketing problem

The research team is particularly concerned about how AI models are being promoted in the healthcare market. "Marketing LLMs as diagnostic agents risks fostering false confidence precisely where they are least reliable," the team explained.

"Persistent failures in generating differential diagnoses and navigating uncertainty show that LLMs cannot yet be trusted in frontline decision-making."

What this means for patients

The findings serve as a stark warning for anyone considering using AI for medical self-diagnosis. The next time you're tempted to ask ChatGPT about that suspicious growth or persistent symptom, remember that today's AI systems are most likely to fail exactly when you need them most - in the early, uncertain stages of diagnosis.

As Succi and Rao both emphasize, the safest approach remains consulting with a human healthcare professional who can properly navigate the complexities and uncertainties of medical diagnosis.

The research underscores a critical gap between AI's current capabilities and the complex reasoning required for medical diagnosis, particularly in the early stages where uncertainty is highest and the stakes are greatest.

#LLMs #Healthcare #Diagnostics #AI #medical