文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

作者信息

Avnat Eden, Levy Michal, Herstain Daniel, Yanko Elia, Ben Joya Daniel, Tzuchman Katz Michal, Eshel Dafna, Laros Sahar, Dagan Yael, Barami Shahar, Mermelstein Joseph, Ovadia Shahar, Shomron Noam, Shalev Varda, Abdulnour Raja-Elie E

机构信息

Faculty of Medicine, Tel Aviv University, Chaim Levanon St 55, Tel Aviv, 6997801, Israel, 972 545299622.

Kahun Medical Ltd, Givatayim, Israel.

出版信息

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.


DOI:10.2196/64452
PMID:40658983
Abstract

BACKGROUND: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization. OBJECTIVE: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans. METHODS: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models. RESULTS: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13). CONCLUSIONS: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.

摘要

背景:临床问题解决需要处理语义医学知识,如疾病脚本,以及用于循证决策的诊断测试的数值医学知识。由于大语言模型(LLMs)在基于语言的临床实践的许多方面都显示出了有前景的结果,它们生成基于非语言循证的临床问题答案的能力在本质上受到词元化的限制。 目的:本研究旨在评估大语言模型在两种问题类型上的表现:数值型(关联发现)和语义型(区分实体),同时考察大语言模型在医学方面的内部和之间的差异,并将它们的表现与人类进行比较。 方法:为了基于循证医学(EBM)生成直接的多项选择题及答案(问答),我们使用了一个综合医学知识图谱(包含来自50000多篇同行评审研究的数据)并创建了循证医学问答(EBMQAs)。EBMQA由105222个问答组成,按医学主题(如医学学科)和非医学主题(如问题长度)分类,并分为数值型或语义型。我们在两个最先进的大语言模型GPT-4(OpenAI)和Claude 3 Opus(Anthropic)上对一个包含24000个问答的数据集进行了基准测试。我们评估了大语言模型在语义和数值问题类型上以及根据子标记主题的准确性。此外,我们通过让大语言模型选择不回答问题来考察它们的问答率。为了进行验证,我们比较了六位人类医学专家和这两个语言模型对100个不相关的数值型EBMQA问题的结果。 结果:在对24542个问答的分析中,Claude 3和GPT-4在语义问答上的表现分别更好(分别为68.7%,n = 1593和68.4%,n = 1709)。然后在数值问答上的表现分别为61.3%,n = 8583和56.7%,n = 12038,Claude 3在数值准确性上优于GPT-4(P <.001)。每个主题中最佳和最差子标记之间观察到的中位准确性差距为7%(四分位距5%-10%),不同的大语言模型在不同的子标记上表现出色。专注于医学学科子标记,Claude 3在肿瘤性疾病方面表现良好,但在泌尿生殖系统疾病方面表现不佳(69%,n = 676对58%,n = 464;P <.0001),而GPT-4在心血管疾病方面表现出色,但在肿瘤性疾病方面表现不佳(60%,n = 1076对53%,n = 704;P =.0002)。此外,在验证测试中,人类(82.3%,n = 82.3)超过了Claude 3(64.3%,n = 64.3;P <.001)和GPT-4(55.8%,n = 55.8;P <.001)。Claude 3和GPT-4的问答与准确率之间的斯皮尔曼相关性均不显著(ρ = 0.12,P =.69;ρ = 0.43,P =.13)。 结论:两个大语言模型在语义问答上比数值问答表现更好,Claude 3在数值问答上超过了GPT-4。然而,两个大语言模型在不同医学方面都显示出模型间和模型内的差距,并且仍然不如人类。此外,它们回答或不回答问题的能力并不能可靠地预测它们尝试回答问题时的表现有多准确。因此,它们的医学建议应谨慎对待。

相似文献

[1]
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

J Med Internet Res. 2025-7-14

[2]
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.

J Surg Educ. 2025-4

[3]
Sexual Harassment and Prevention Training

2025-1

[4]
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025-5-20

[5]
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.

JMIR Form Res. 2024-12-17

[6]
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.

J Med Internet Res. 2025-2-7

[7]
Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024-12-11

[8]
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024-12-1

[9]
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025-7-11

[10]
Short-Term Memory Impairment

2025-1

本文引用的文献

[1]
Toward expert-level medical question answering with large language models.

Nat Med. 2025-3

[2]
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.

Int J Med Inform. 2025-1

[3]
The Potential of Evidence-Based Clinical Intake Tools to Discover or Ground Prevalence of Symptoms Using Real-Life Digital Health Encounters: Retrospective Cohort Study.

J Med Internet Res. 2024-7-16

[4]
Evaluating large language models as agents in the clinic.

NPJ Digit Med. 2024-4-3

[5]
Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review.

Diagn Pathol. 2024-2-27

[6]
Large language models encode clinical knowledge.

Nature. 2023-8

[7]
Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers.

Digit Health. 2023-7-2

[8]
Implications of large language models such as ChatGPT for dental medicine.

J Esthet Restor Dent. 2023-10

[9]
Deep learning in medical image analysis: A third eye for doctors.

J Stomatol Oral Maxillofac Surg. 2019-6-26

[10]
Semantic data interoperability, digital medicine, and e-health in infectious disease management: a review.

Eur J Clin Microbiol Infect Dis. 2019-2-15

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索