o1 beats gpt-4 in the medical field, and its performance skyrockets! the chinese team issued an article: we are getting closer to ai doctors.

2024-10-04

new wisdom report

editor: lrs

[introduction to new wisdom]openai's o1 model has demonstrated remarkable performance on general language tasks. the latest evaluation shows the performance of the o1 model in the medical field, focusing on understanding, reasoning and multi-language capabilities. the results significantly surpassed previous models!

when the big language model was first released, it successfully broke through with its task, domain versatility and smooth text generation capabilities. however, the technology at that time could only be applied to some relatively simple tasks.

with the emergence of prompt technologies such as thinking chain, especially the newly released o1 model of openai, it is the first to adopt the internalized thinking chain technology of reinforcement learning strategy, which improves the ability of large models to solve complex problems and reasoning to a whole new level.

although the o1 model has shown surprisingly strong capabilities on various general language tasks, its performance in professional fields such as medicine is still unknown.

a chinese team from the university of california, santa cruz, the university of edinburgh, and the national institutes of health jointly released a report, conducting a comprehensive exploration of o1 in different medical scenarios, and examining the performance of the model in understanding and reasoning. ) and multilinguality capabilities.

the assessment covers six tasks using data from 37 medical datasets, including two difficult question-and-answer tasks based on the new england journal of medicine (nejm) and the lancet professional medical test.

compared with standard medical question answering benchmarks such as medqa, these datasets are more clinically relevant and can be applied more effectively in real-world clinical scenarios.

the analysis of the o1 model shows that the enhancement of the reasoning ability of llms is more conducive to the model's understanding of various medical instructions, and can also improve the model's ability to reason in complex clinical scenarios.

it is worth noting that the accuracy of the o1 model in 19 data sets and two complex question and answer scenarios exceeded the previous gpt-4 by 6.2% and 6.6% on average.

at the same time, researchers found several flaws in model capabilities and existing evaluation protocols, including hallucinations, inconsistent multilingual capabilities, and inconsistent evaluation metrics.

comprehensive assessment of medical capabilities of large models

in terms of improving the model's reasoning ability, chain of thought (cot) prompts are a commonly used prompt strategy, which uses the reasoning patterns within the model to enhance the ability to solve complex tasks.

the o1 model goes one step further, embedding the cot process into model training, integrating reinforcement learning, and demonstrating strong reasoning performance; however, the o1 model has not yet been evaluated with data in professional fields, and its performance on specific tasks is still unknown.

existing llm benchmarks in the medical field usually only evaluate specific capabilities of the model, such as knowledge and reasoning, security and multi-language. the tests are relatively isolated from each other and cannot comprehensively evaluate advanced models like o1.

to ensure a comprehensive assessment, the researchers collected a variety of medical tasks and data sets covering the above aspects and explored three prompting strategies in the process, including:

1. direct prompts to guide large language models to directly solve problems

2. thinking chain, which requires the model to think step by step before generating the final answer.

3. few-shot hints provide the model with several examples to learn the input-output mapping on the fly.

finally, use an appropriate metric to measure the difference between the generated responses and the real answers.

focus and tasks

the researchers utilized 35 existing datasets and created 2 additional datasets with higher difficulty for evaluation, and then classified all 37 datasets into 3 aspects and 6 tasks for clearer evaluation and analytics to understand how a model performs in a specific domain.

understanding，refers to the model's ability to use its internal medical knowledge to understand medical concepts.

for example, in concept recognition tasks, models need to extract or elaborate medical concepts from articles or diagnostic reports; in text summarization, models need to understand concepts in complex texts to generate concise summaries.

reasoning，test the model's ability to think logically through multiple steps to reach conclusions.

in question and answer tasks, the model needs to follow prompt instructions to reason based on the medical information provided in the question and select the correct answer from multiple options.

in addition to common question and answer datasets, the researchers also collected real-world clinical questions from the lancet, the new england journal of medicine (nejm), and medbullets to better evaluate the clinical utility of llms.

in clinical recommendation tasks, models need to provide treatment recommendations or diagnostic decisions based on patient information. in the ai hospital and agentclinic datasets, the model needs to act as a medical agent; in the medcalc-bench dataset, the model needs to perform mathematical reasoning and calculate answers.

multilinguality, the languages for inputting instructions and outputting answers are different.

the xmedbench data set requires llms to answer medical questions in six languages, including chinese, arabic, hindi, spanish, chinese and english; in the ai hospital data set, the model needs to use chinese for question and answer.

evaluation indicators

accuracy, a direct measure of the percentage of answers generated by the model that exactly match the true answer.

mainly used when the real answer is a word or phrase, including multiple choice question data sets, medcalcbench data sets, and clinical advice and concept identification data sets.

f1 score, the harmonic mean of precision and recall, is used on data sets where the model needs to select multiple correct answers.

bleu and rouge, a natural language processing metric that measures the similarity between generated responses and real answers, using bleu-1 and rouge-1 for all free-form generation tasks in the evaluation

AlignScore, a metric that measures the consistency of generated text facts, uses alignscore for all unspecified format generation tasks to evaluate the degree of model illusion.

Mauve, a metric that measures the difference between the distribution of generated text and human-written text, is used for all unspecified format generation tasks. the value of the metric ranges from 0 to 100, with higher values indicating higher quality of the model output.

experimental results

prompt strategy

for knowledge question and answer tasks, agent tasks, medical computing tasks and multi-language related tasks, use direct prompt evaluation methods;

for other tasks from meds-bench, the three-sample prompting strategy in the benchmark setting is followed.

according to openai’s statement, common prompting techniques such as chain of thoughts (cot) and examples in context are not very helpful in improving o1 performance because the model already has an implicit cot built in.

to further test this claim, the researchers added the effects of several advanced cues to the assessment, including cot, self consistency, and reflex

in addition to selecting gpt-3.5, gpt-4, and o1 models for evaluation, the researchers also selected two open source models: one is a large language model meditron-70b trained with medical center data, and the latest and most powerful open source model large language model llama3-8b

main results

o1's capabilities in clinical understanding have been enhanced

when the o1 model was released, openai mainly emphasized itssignificant improvements in knowledge and reasoning abilities, such as mathematical problem solving and code generation, can also be observed from experimental results, and this ability can also be transferred to specific clinical knowledge understanding.

it can be seen that o1 outperforms other models in terms of understanding of most clinical tasks. for example, o1 outperforms gpt-4 and gpt-3.5 on average on 5 concept recognition datasets using f1 as a metric, respectively. 7.6% and 26.6% higher, with an average improvement of 24.5% on the commonly used bc4chem dataset.

on the summary task, o1 improved its rouge-1 score by 2.4% and 3.7% respectively compared to gpt-4 and gpt-3.5, proving its enhanced ability in real-world clinical understanding. the results also confirmed the role of large language models in advances in general natural language processing capabilities can effectively translate into enhanced model understanding in the medical field.

the powerful reasoning ability of o1 model in clinical diagnosis scenarios

on reasoning-related tasks, the o1 model has also demonstrated its advantages in real-world diagnostic situations.

in the newly constructed and challenging question answering tasks nejmqa and lancetqa, the average accuracy of o1 on the respective datasets is improved by 8.9% and 27.1% compared to gpt-4 (79.6%) and gpt-3.5 (61.5%) respectively.

another notable improvement in o1's mathematical reasoning capabilities is that it improves the medcalc-bench baseline to 34.9%, a significant 9.4% higher than gpt-4

in more complex reasoning scenarios involving multiple rounds of dialogue and environment simulation, o1 outperforms gpt-4 and gpt-3.5 on the agentclinic benchmark, gaining at least 15.5% and 10% on the medqa and nejm subsets, respectively. the accuracy rate was improved, with scores of 45.5% and 20.0% respectively.

in addition to higher accuracy, o1's answers are also more concise and direct, while gpt-4 generates hallucinatory explanations next to wrong answers.

the researchers believe that o1’s improvements in knowledge and reasoning are mainly attributed to the use of enhanced data and underlying techniques (such as cot data and reinforcement learning techniques) during the training process.

based on the above optimistic results, the researchers excitedly stated in the paper: with the o1 model, we are getting closer and closer to a fully automatic ai doctor.

references:

news