Avnat Eden, Levy Michal, Herstain Daniel, Yanko Elia, Ben Joya Daniel, Tzuchman Katz Michal, Eshel Dafna, Laros Sahar, Dagan Yael, Barami Shahar, Mermelstein Joseph, Ovadia Shahar, Shomron Noam, Shalev Varda, Abdulnour Raja-Elie E
Faculty of Medicine, Tel Aviv University, Chaim Levanon St 55, Tel Aviv, 6997801, Israel, 972 545299622.
Kahun Medical Ltd, Givatayim, Israel.
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
BACKGROUND: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization. OBJECTIVE: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans. METHODS: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models. RESULTS: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13). CONCLUSIONS: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.
背景:临床问题解决需要处理语义医学知识,如疾病脚本,以及用于循证决策的诊断测试的数值医学知识。由于大语言模型(LLMs)在基于语言的临床实践的许多方面都显示出了有前景的结果,它们生成基于非语言循证的临床问题答案的能力在本质上受到词元化的限制。 目的:本研究旨在评估大语言模型在两种问题类型上的表现:数值型(关联发现)和语义型(区分实体),同时考察大语言模型在医学方面的内部和之间的差异,并将它们的表现与人类进行比较。 方法:为了基于循证医学(EBM)生成直接的多项选择题及答案(问答),我们使用了一个综合医学知识图谱(包含来自50000多篇同行评审研究的数据)并创建了循证医学问答(EBMQAs)。EBMQA由105222个问答组成,按医学主题(如医学学科)和非医学主题(如问题长度)分类,并分为数值型或语义型。我们在两个最先进的大语言模型GPT-4(OpenAI)和Claude 3 Opus(Anthropic)上对一个包含24000个问答的数据集进行了基准测试。我们评估了大语言模型在语义和数值问题类型上以及根据子标记主题的准确性。此外,我们通过让大语言模型选择不回答问题来考察它们的问答率。为了进行验证,我们比较了六位人类医学专家和这两个语言模型对100个不相关的数值型EBMQA问题的结果。 结果:在对24542个问答的分析中,Claude 3和GPT-4在语义问答上的表现分别更好(分别为68.7%,n = 1593和68.4%,n = 1709)。然后在数值问答上的表现分别为61.3%,n = 8583和56.7%,n = 12038,Claude 3在数值准确性上优于GPT-4(P <.001)。每个主题中最佳和最差子标记之间观察到的中位准确性差距为7%(四分位距5%-10%),不同的大语言模型在不同的子标记上表现出色。专注于医学学科子标记,Claude 3在肿瘤性疾病方面表现良好,但在泌尿生殖系统疾病方面表现不佳(69%,n = 676对58%,n = 464;P <.0001),而GPT-4在心血管疾病方面表现出色,但在肿瘤性疾病方面表现不佳(60%,n = 1076对53%,n = 704;P =.0002)。此外,在验证测试中,人类(82.3%,n = 82.3)超过了Claude 3(64.3%,n = 64.3;P <.001)和GPT-4(55.8%,n = 55.8;P <.001)。Claude 3和GPT-4的问答与准确率之间的斯皮尔曼相关性均不显著(ρ = 0.12,P =.69;ρ = 0.43,P =.13)。 结论:两个大语言模型在语义问答上比数值问答表现更好,Claude 3在数值问答上超过了GPT-4。然而,两个大语言模型在不同医学方面都显示出模型间和模型内的差距,并且仍然不如人类。此外,它们回答或不回答问题的能力并不能可靠地预测它们尝试回答问题时的表现有多准确。因此,它们的医学建议应谨慎对待。
J Med Internet Res. 2025-2-7
J Med Internet Res. 2024-12-11
J Med Internet Res. 2025-7-11
2025-1
NPJ Digit Med. 2024-4-3
Nature. 2023-8
J Esthet Restor Dent. 2023-10
J Stomatol Oral Maxillofac Surg. 2019-6-26
Eur J Clin Microbiol Infect Dis. 2019-2-15