Gettyimages 2221639939

Bot-ched advice – 'disturbing' results in AI study

Health & Society
By Peter Blackburn
10.07.25

Chatbots do quite well in diagnosing conditions ... but only in other chatbots, new research has found. Peter Blackburn reports

AI chatbots cannot replace a doctor and can produce ‘disturbing’ results when allowed to interact with human patients.

Those are the warnings from doctors and experts involved in a study in which commonly used LLM (large language model) chatbots such as GPT-4o or Llama 3 were asked to diagnose patients and determine the acuity of their condition.

The study looked at whether chatbots could accurately diagnose patients with common conditions, and whether the chatbots directed them to appropriate sources of care. It also compared the performance of chatbots when interacting with humans to their performance when interacting with another chatbot or with case vignettes presented directly to the chatbot. The chatbots performed dramatically worse when interacting with real humans than with case vignettes or with other chatbots.

Without a human involved, chatbots completed the scenarios accurately, identifying relevant conditions in 94.9 per cent of cases and disposition in 56.3 per cent on average – but involve a human and those numbers dropped markedly to correct diagnosis in 34.5 per cent and disposition in less than 44.2 per cent. Humans using a chatbot were less likely to get the right diagnosis than the control group who could use any sources they chose.

dr payne
PAYNE: Surprised by how badly LLM chatbots did when interacting with patients

Experts say the study has ‘startling implications’ that, despite all the hype around AI in healthcare, ultimately patients using these cutting-edge technologies were less likely to get the right diagnosis than patients googling symptoms, asking friends, or seeking advice elsewhere.

Rebecca Payne, a GP, clinical senior lecturer at Bangor University and a Reuben-Clarendon scholar at Oxford University, was the clinical lead for the study, entitled ‘Clinical knowledge in LLMs does not translate to human interactions’.

Dr Payne says: ‘I was surprised at how badly the LLM chatbots did when interacting with patients. The implications are so significant. Ultimately, AI can have a role in healthcare, but not as a physician.’

When researchers analysed the full transcripts of interactions, they found that LLMs often suggested the right diagnosis at some point in the conversation but that by the end of the chat the human no longer considered it a real possibility.

The study suggests that the current practice of testing one LLM by using another does not replicate the realities of a real human interacting with the software – and experts say the findings are a stark warning that any products used by humans must be tested by humans not just ‘the tech-literate pals of software developers’. It also suggests that, while developers claim LLMs can pass medical exams, this doesn’t mean they are able to treat patients.

Dr Payne adds: ‘Knowledge alone isn’t enough to get through medical school, every trainee doctor has to engage with real patients. Passing a driving theory test doesn’t mean that you are safe to drive on the road.’

Embedding AI technologies within healthcare systems creates huge amounts of stress for staff and results are often suboptimal

Rebecca Payne

Further to that, there are significant implications for the potential rollout of LLMs within medicine as physicians will need to be aware patients may be getting inaccurate information from these sources and may be presenting to an inappropriate setting.

Dr Payne says: ‘Herein lies the mismatch between the aspirations of policymakers and the reality of the technologies we have. AI isn’t going to substitute for a lack of healthcare workers. Even when technologies are more functional, embedding them within healthcare systems creates huge amounts of stress for staff and results are often suboptimal. No amount of digitalisation can compensate for a lack of resources within healthcare systems.’

Andrew Bean
BEAN: 'Interacting with humans poses a challenge even for top LLMs'

Oxford doctoral researcher Andrew Bean, the lead author of the study, tells The Doctor there were three findings: the process was no better than traditional methods, with patients not making better decisions with the LLMs; the study revealed a two-way communication breakdown resulting in a mixture of good and poor recommendations making it difficult to identify the best course of action; and that existing evaluations methods for LLMs fall short and need to be tested in the real world before being deployed.

He says: ‘Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems.’

Bold claims

Just this week, Microsoft revealed a new AI system that it claims performs better than human doctors when asked to diagnose complex health issues – describing the development as creating ‘a path to medical super intelligence’. The system, which is paired with OpenAI’s ‘advanced o3 AI model’, aims to imitate a panel of expert physicians the company claims can tackle ‘diagnostically complex and intellectually demanding cases’.

The company says the system ‘solved’ more than eight in 10 case studies it was given whereas practising physicians solved two out of 10. Bosses claim it could also be cheaper than using human doctors owing to being ‘more efficient’ at ordering tests.

Last year, the BMA produced a report addressing the potential benefits and drawbacks of AI in the delivery and monitoring of healthcare.

The study suggested that AI holds promise for transforming healthcare by improving precision, efficiency and preventive measures and that it can enhance diagnostic accuracy, personalise treatments, and streamline administrative tasks, potentially reducing healthcare demand and outcomes. However, the report also suggests the success of AI depends on its implementation, including proper testing, integration into workflows, and addressing issues of liability, regulation and data governance.               

It said: ‘Risks include potential harms to patient health, exacerbation of health inequalities and impacts on doctor-patient relationships and productivity. Effective AI use requires careful management to maximise benefits and mitigate risks, ensuring it complements and enhances existing healthcare systems.'

The study is now available to read as a Preprint

(Main image credit: Getty)