• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.

作者信息

Avnat Eden, Levy Michal, Herstain Daniel, Yanko Elia, Ben Joya Daniel, Tzuchman Katz Michal, Eshel Dafna, Laros Sahar, Dagan Yael, Barami Shahar, Mermelstein Joseph, Ovadia Shahar, Shomron Noam, Shalev Varda, Abdulnour Raja-Elie E

机构信息

Faculty of Medicine, Tel Aviv University, Chaim Levanon St 55, Tel Aviv, 6997801, Israel, 972 545299622.

Kahun Medical Ltd, Givatayim, Israel.

出版信息

J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.

DOI:10.2196/64452
PMID:40658983
Abstract

BACKGROUND

Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.

OBJECTIVE

This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.

METHODS

To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.

RESULTS

In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).

CONCLUSIONS

Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.

摘要

背景

临床问题解决需要处理语义医学知识,如疾病脚本,以及用于循证决策的诊断测试的数值医学知识。由于大语言模型(LLMs)在基于语言的临床实践的许多方面都显示出了有前景的结果,它们生成基于非语言循证的临床问题答案的能力在本质上受到词元化的限制。

目的

本研究旨在评估大语言模型在两种问题类型上的表现:数值型(关联发现)和语义型(区分实体),同时考察大语言模型在医学方面的内部和之间的差异,并将它们的表现与人类进行比较。

方法

为了基于循证医学(EBM)生成直接的多项选择题及答案(问答),我们使用了一个综合医学知识图谱(包含来自50000多篇同行评审研究的数据)并创建了循证医学问答(EBMQAs)。EBMQA由105222个问答组成,按医学主题(如医学学科)和非医学主题(如问题长度)分类,并分为数值型或语义型。我们在两个最先进的大语言模型GPT-4(OpenAI)和Claude 3 Opus(Anthropic)上对一个包含24000个问答的数据集进行了基准测试。我们评估了大语言模型在语义和数值问题类型上以及根据子标记主题的准确性。此外,我们通过让大语言模型选择不回答问题来考察它们的问答率。为了进行验证,我们比较了六位人类医学专家和这两个语言模型对100个不相关的数值型EBMQA问题的结果。

结果

在对24542个问答的分析中,Claude 3和GPT-4在语义问答上的表现分别更好(分别为68.7%,n = 1593和68.4%,n = 1709)。然后在数值问答上的表现分别为61.3%,n = 8583和56.7%,n = 12038,Claude 3在数值准确性上优于GPT-4(P <.001)。每个主题中最佳和最差子标记之间观察到的中位准确性差距为7%(四分位距5%-10%),不同的大语言模型在不同的子标记上表现出色。专注于医学学科子标记,Claude 3在肿瘤性疾病方面表现良好,但在泌尿生殖系统疾病方面表现不佳(69%,n = 676对58%,n = 464;P <.0001),而GPT-4在心血管疾病方面表现出色,但在肿瘤性疾病方面表现不佳(60%,n = 1076对53%,n = 704;P =.0002)。此外,在验证测试中,人类(82.3%,n = 82.3)超过了Claude 3(64.3%,n = 64.3;P <.001)和GPT-4(55.8%,n = 55.8;P <.001)。Claude 3和GPT-4的问答与准确率之间的斯皮尔曼相关性均不显著(ρ = 0.12,P =.69;ρ = 0.43,P =.13)。

结论

两个大语言模型在语义问答上比数值问答表现更好,Claude 3在数值问答上超过了GPT-4。然而,两个大语言模型在不同医学方面都显示出模型间和模型内的差距,并且仍然不如人类。此外,它们回答或不回答问题的能力并不能可靠地预测它们尝试回答问题时的表现有多准确。因此,它们的医学建议应谨慎对待。

相似文献

1
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
2
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
3
Sexual Harassment and Prevention Training性骚扰与预防培训
4
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
5
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
6
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
9
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.在医疗保健中应用大语言模型:以临床医生为重点的回顾与交互式指南
J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.
10
Short-Term Memory Impairment短期记忆障碍

本文引用的文献

1
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
2
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
3
The Potential of Evidence-Based Clinical Intake Tools to Discover or Ground Prevalence of Symptoms Using Real-Life Digital Health Encounters: Retrospective Cohort Study.
基于证据的临床摄入工具在真实数字健康互动中发现或确定症状流行率的潜力:回顾性队列研究。
J Med Internet Res. 2024 Jul 16;26:e49570. doi: 10.2196/49570.
4
Evaluating large language models as agents in the clinic.评估大型语言模型作为临床中的智能体。
NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.
5
Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review.使用大型语言模型(如 ChatGPT)进行诊断医学的挑战和障碍,重点是数字病理学——近期的范围综述。
Diagn Pathol. 2024 Feb 27;19(1):43. doi: 10.1186/s13000-024-01464-7.
6
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
7
Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers.医疗保健领域的人工智能:辅助医生和医疗服务提供者,而非取而代之。
Digit Health. 2023 Jul 2;9:20552076231186520. doi: 10.1177/20552076231186520. eCollection 2023 Jan-Dec.
8
Implications of large language models such as ChatGPT for dental medicine.ChatGPT 等大型语言模型对牙科医学的影响。
J Esthet Restor Dent. 2023 Oct;35(7):1098-1102. doi: 10.1111/jerd.13046. Epub 2023 Apr 5.
9
Deep learning in medical image analysis: A third eye for doctors.深度学习在医学图像分析中的应用:医生的“第三只眼”。
J Stomatol Oral Maxillofac Surg. 2019 Sep;120(4):279-288. doi: 10.1016/j.jormas.2019.06.002. Epub 2019 Jun 26.
10
Semantic data interoperability, digital medicine, and e-health in infectious disease management: a review.语义数据互操作性、数字医学和传染病管理中的电子健康:综述。
Eur J Clin Microbiol Infect Dis. 2019 Jun;38(6):1023-1034. doi: 10.1007/s10096-019-03501-6. Epub 2019 Feb 15.