news

AI is completely defeated by human doctors! Research finds that large models make sloppy and unsafe clinical decisions, with the lowest accuracy rate being only 13

2024-07-29

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Will human doctors be laid off because of large models such as ChatGPT?

This worry is not unfounded. After all, Google's large model (Med-PaLM 2) has easily passed the US Medical Licensing Examination and reached the level of medical experts.

However, a recent study shows that:In clinical terms,Human doctors are far superior to current artificial intelligence (AI) models, so there is no need to worry too much about personal "unemployment issues."

The related research paper, titled "Evaluation and mitigation of the limitations of large language models in clinical decision-making", was recently published in the scientific journal Nature Medicine.


The study found that even the most advanced large language models (LLMs) cannot make accurate diagnoses for all patients and perform significantly worse than human doctors.

The doctors' diagnosis was correct 89% of the time, while the LLM's was correct only 73% of the time. In one extreme case (cholecystitis), the LLM was correct only 13%.

Even more surprising is that as LLMs learn more about a case, their diagnostic accuracy decreases, and they sometimes even order tests that could pose serious health risks to the patient.

How does an LLM perform as an emergency physician?

Although LLM can easily pass the US medical licensing examination,Medical licensing exams and clinical case challenges are designed to test only general medical knowledge of candidates and are far less difficult than complex daily clinical decision-making tasks.

Clinical decision making is a multistep process that requires the collection and integration of data from different sources and the ongoing evaluation of facts to reach evidence-based patient diagnostic and treatment decisions.

To further study the potential of LLM in clinical diagnosis, a research team from the Technical University of Munich and its collaborators created a dataset covering 2,400 real patient cases and four common abdominal diseases (appendicitis, pancreatitis, cholecystitis, and diverticulitis) based on the Medical Information Market Intensive Care Database (MIMIC-IV).Simulate a realistic clinical environment to reproduce the process from emergency to treatment , thereby evaluating its suitability as a clinical decision maker.


Figure | Dataset source and evaluation framework. The dataset is derived from real cases in the MIMIC-IV database and contains comprehensive electronic health record data recorded during hospitalization. The evaluation framework reflects a realistic clinical environment and comprehensively evaluates LLM from multiple criteria, including diagnostic accuracy, compliance with diagnostic and treatment guidelines, consistency in following instructions, ability to interpret laboratory results, and robustness to changes in instructions, information volume, and information order. ICD, International Classification of Diseases; CT, computed tomography; US, ultrasound; MRCP, magnetic resonance pancreaticobiliary imaging.

The research team tested Llama 2 and its derivatives, including general-purpose versions (such as Llama 2 Chat, Open Assistant, WizardLM) and models aligned to the medical domain (such as Clinical Camel and Meditron).

Due to privacy issues and data usage agreements of MIMIC data, the data cannot be used for external APIs such as OpenAI or Google, so ChatGPT, GPT-4 and Med-PaLM could not be tested. It is worth noting that Llama 2, Clinical Camel and Meditron have achieved the same or even better performance than ChatGPT in the medical licensing exam and biomedical question answering test.

Test control group The results showed that LLM performed far worse than human doctors in clinical diagnosis.

1. The diagnostic performance of LLM is significantly lower than that of clinical

The results for doctors showed that the current LLMs were significantly inferior to doctors in overall performance for all diseases (P < 0.001).The difference in diagnostic accuracy is between 16% and 25%. Although the model performed well in the diagnosis of simple appendicitis, it performed poorly in the diagnosis of other pathologies such as cholecystitis. In particular, the Meditron model failed in the diagnosis of cholecystitis and often diagnosed the patient as having "gallstones".

The overall performance of the professional medical LLM was not significantly better than that of other models. , and when LLM needs to collect all the information by itself, its performance will further decline.


Figure | Diagnostic accuracy under full information provision conditions. Data are based on a subset of MIMIC-CDM-FI (n=80), with the mean diagnostic accuracy shown above each bar graph and the vertical line indicating the standard deviation. The mean performance of LLM was significantly worse (P < 0.001), especially in cholecystitis (P < 0.001) and diverticulitis (P < 0.001).


Figure | Diagnostic accuracy in autonomous clinical decision-making scenarios. Compared with the scenario of full information provision, the overall accuracy of model judgment has decreased significantly. LLM performed best in diagnosing appendicitis, but performed poorly in the three pathologies of cholecystitis, diverticulitis and pancreatitis.

2. LLM’s clinical decision making is hasty and unsafe

The research team found thatLLMs are poor at following diagnostic guidelines and are prone to missing important medical information about their patients There was also a lack of consistency in ordering necessary laboratory tests for patients. LLMs also had significant deficiencies in interpreting laboratory results. This showed that they made hasty diagnoses without fully understanding the patient's case, posing a serious risk to the patient's health.


Figure | LLM recommended treatment evaluation. The desired treatment is determined based on clinical guidelines and the treatment actually received by patients in the dataset. Of the 808 patients, Llama 2 Chat correctly diagnosed 603. Of these 603 patients, Llama 2 Chat correctly recommended appendectomy in 97.5% of cases.

3. LLM still requires a lot of clinical supervision from doctors

in addition,All current LLMs are poor at following basic medical guidance , errors were made in every 2-4 cases, and non-existent instructions were fabricated in every 2-5 cases.


Figure | Performance of LLM with different amounts of data. The study compared the performance of each model using all diagnostic information versus using only a single diagnostic test and history of present illness. For almost all diseases, providing all information did not lead to the best performance in the MIMIC-CDM-FI dataset. This suggests that LLM is unable to focus on key facts and performance degrades when too much information is provided.

The study also showed that the order of information that gave each model the best performance was different for each pathology, further complicating the difficulty of subsequent optimization of the models. They could not be reliably performed without extensive physician supervision and prior evaluation. Overall, they had detailed flaws in following instructions, the order in which they processed information, and the processing of related information, so they required a lot of clinical supervision to ensure they operated correctly.

Although the study found various problems with LLM in clinical diagnosis, LLM still has great prospects in medicine and is likely to be more suitable for diagnosis based on medical history and test results. The research team believes thatThis research has room for further development in the following two aspects:

  • Model validation and testing: Further research should focus on more comprehensive validation and testing of the LLM to ensure its validity in real clinical settings.

  • Multidisciplinary collaboration: It is recommended that AI experts work closely with clinicians to jointly develop and optimize LLMs that are applicable to clinical practice and solve problems in practical applications.

How is AI disrupting healthcare?

In addition to the above research, a team from the National Institutes of Health (NIH) and its collaborators also found similar problems - when answering 207 image challenge questions,While GPT-4V scored highly in selecting the correct diagnosis, it often made mistakes in describing medical images and explaining the reasoning behind the diagnosis.

Although AI is currently far inferior to human professional doctors, its research and application in the medical industry has always been an important "battlefield" for domestic and foreign technology companies and scientific research universities to compete.

For example, Google releasedMedical AI Large Model Med-PaLM2 , with powerful diagnostic and treatment capabilities, and is also the first large model to reach the “expert” level in the MedQA test set.


The Tsinghua University research team proposedAgent Hospital , can simulate the entire process of treating a disease. Its core goal is to allow the doctor agent to learn how to treat the disease in a simulated environment, and even continuously accumulate experience from successful and failed cases to achieve self-evolution.


Harvard Medical School leads development of a new method for human pathologyVisual language universal AI assistant——PathChat , which can correctly identify diseases from biopsy slices in nearly 90% of cases, outperforming general AI models and professional medical models currently on the market, such as GPT-4V.


Figure|Instruction fine-tuning dataset and PathChat construction

Recently, OpenAI CEO Sam Altman participated in the establishment of a new company, Thrive AI Health, which aims to use AI technology to help people improve their daily habits and reduce the mortality rate of chronic diseases.

They said,Hyper-personalized AI technology It can effectively improve people's living habits, thereby preventing and managing chronic diseases, reducing the medical economic burden, and improving people's overall health level.

Today, the application of AI in the medical industry has gradually transitioned from the initial experimental stage to the practical application stage, but it may still have a long way to go before it can help clinicians enhance their capabilities, improve clinical decision-making, or even directly replace them.