Oxford researchers find AI chatbots perform no better than search engines for medical advice, potentially putting patients at risk through mixed recommendations and user errors.
A comprehensive study by Oxford researchers has found that AI chatbots fail to improve medical advice quality, potentially putting patients at risk through inaccurate recommendations and user errors.
Study methodology and findings
Researchers from the Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences partnered with MLCommons to evaluate how well AI chatbots assist the general public with medical decision-making. The study involved 1,298 UK participants who were presented with ten expert-designed medical scenarios requiring them to identify potential health conditions and recommend appropriate courses of action.
The participants were divided into two groups: a treatment group that used AI chatbots (GPT-4o, Llama 3, and Command R+) and a control group that relied on their usual diagnostic methods, typically internet search or personal knowledge.
Despite AI models showing high proficiency when tested individually, the combination of LLMs and human users performed no better than the control group in assessing clinical acuity and actually performed worse at identifying relevant conditions.
Critical failures in real-world scenarios
The study revealed several concerning patterns in how AI chatbots interact with users seeking medical advice:
- Mixed messaging: LLMs often provided contradictory recommendations within the same interaction, combining good and bad advice
- Inconsistent responses: Similar symptom descriptions received opposite recommendations - one user was told to lie down in a dark room while another with nearly identical symptoms was correctly advised to seek emergency care
- Incorrect information: Chatbots recommended calling partial US phone numbers while simultaneously suggesting the Australian emergency number "Triple Zero"
The gap between benchmarks and reality
One of the most significant findings was that standard benchmark testing methods fail to capture real-world human-AI interactions. While AI models excel at responding to structured questions based on medical licensing exams, they fall short in interactive scenarios that require nuanced understanding and contextual awareness.
"Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," said Luc Rocher, associate professor at the Oxford Internet Institute.
User error compounds the problem
The study found that users struggled to provide chatbots with relevant information, which compounded the accuracy issues. This highlights a fundamental challenge: even with advanced AI capabilities, the human element introduces significant variability and potential for error.
Implications for healthcare AI deployment
The findings pose a significant challenge to commercial AI service providers like Anthropic, Google, and OpenAI, all of which have expressed interest in selling AI solutions to the healthcare market. The research suggests that current AI chatbots are not ready for real-world medical decision-making, despite their strong performance on medical benchmarks.
"As more people rely on chatbots for medical advice, we risk flooding already strained hospitals with incorrect but plausible diagnoses," Rocher warned.
The authors conclude that safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge, emphasizing the need for better integration of rule-based protocols and clinical reasoning that doctors develop through years of practice.
The study raises important questions about the rush to deploy AI in healthcare settings and suggests that more research and development is needed before these tools can be safely integrated into medical decision-making processes.


Comments
Please log in or register to join the discussion