• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

迈向使用大语言模型实现专家级医学问答

Toward expert-level medical question answering with large language models.

作者信息

Singhal Karan, Tu Tao, Gottweis Juraj, Sayres Rory, Wulczyn Ellery, Amin Mohamed, Hou Le, Clark Kevin, Pfohl Stephen R, Cole-Lewis Heather, Neal Darlene, Rashid Qazi Mamunur, Schaekermann Mike, Wang Amy, Dash Dev, Chen Jonathan H, Shah Nigam H, Lachgar Sami, Mansfield Philip Andrew, Prakash Sushant, Green Bradley, Dominowska Ewa, Agüera Y Arcas Blaise, Tomašev Nenad, Liu Yun, Wong Renee, Semturs Christopher, Mahdavi S Sara, Barral Joelle K, Webster Dale R, Corrado Greg S, Matias Yossi, Azizi Shekoofeh, Karthikesalingam Alan, Natarajan Vivek

机构信息

Google Research, Mountain View, CA, USA.

Google DeepMind, Mountain View, CA, USA.

出版信息

Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.

DOI:10.1038/s41591-024-03423-7
PMID:39779926
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11922739/
Abstract

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.

摘要

大语言模型(LLMs)在医学问答方面已展现出潜力,Med-PaLM是首个在美国医学执照考试风格问题中获得“及格”分数的模型。然而,在长篇医学问答和处理现实世界工作流程方面仍存在挑战。在此,我们展示了Med-PaLM 2,它通过基础大语言模型改进、医学领域微调以及通过集成优化和检索链改进推理与基础的新策略相结合,弥合了这些差距。Med-PaLM 2在MedQA数据集上的得分高达86.5%,比Med-PaLM提高了超过19%,并在MedMCQA、PubMedQA和MMLU临床主题数据集上展现出显著的性能提升。我们详细的人工评估框架表明,在九个临床维度中的八个维度上,医生更喜欢Med-PaLM 2的答案而非其他医生的答案。Med-PaLM 2在所有评估指标上也比其前身有显著改进,特别是在旨在探究大语言模型局限性的新对抗性数据集上(P < 0.001)。在一项使用现实世界医学问题的试点研究中,专家在65%的情况下更喜欢Med-PaLM 2的答案而非全科医生的答案。虽然总体上专家的答案仍然更受青睐,但专家和全科医生都认为Med-PaLM 2与医生的答案一样安全,这表明其在现实世界医学应用中的潜力不断增长。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/42741a854c76/41591_2024_3423_Fig5_ESM.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/d411ec105956/41591_2024_3423_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/ba3f336b9845/41591_2024_3423_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/543bd74bd431/41591_2024_3423_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/fcd0535dc0aa/41591_2024_3423_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/42741a854c76/41591_2024_3423_Fig5_ESM.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/d411ec105956/41591_2024_3423_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/ba3f336b9845/41591_2024_3423_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/543bd74bd431/41591_2024_3423_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/fcd0535dc0aa/41591_2024_3423_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e71/11922739/42741a854c76/41591_2024_3423_Fig5_ESM.jpg

相似文献

1
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
2
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
3
One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering.一个语言模型是不够的:利用集成学习的力量进行医学问答。
medRxiv. 2023 Dec 24:2023.12.21.23300380. doi: 10.1101/2023.12.21.23300380.
4
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
5
OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models.OpenMedLM:在使用开源大语言模型进行医学问答时,基于提示的工程学可以胜过微调。
Sci Rep. 2024 Jun 19;14(1):14156. doi: 10.1038/s41598-024-64827-6.
6
Performance of Large Language Models on a Neurology Board-Style Examination.大语言模型在神经科 board-style 考试中的表现。
JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.
7
Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination.评估大语言模型在回答和分析中国牙科执业资格考试方面的表现。
Eur J Dent Educ. 2025 May;29(2):332-340. doi: 10.1111/eje.13073. Epub 2025 Jan 31.
8
Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study.使用大语言模型自动从外科病理报告中提取数据:回顾性队列研究。
JMIR Form Res. 2025 Apr 7;9:e64544. doi: 10.2196/64544.
9
Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.评估和增强用于遗传咨询支持的日本大语言模型:领域适应的比较研究与专家评估数据集的开发
JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.
10
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.

引用本文的文献

1
ChatGPT's role in the rapidly evolving hematologic cancer landscape.ChatGPT在迅速演变的血液学癌症领域中的作用。
Future Sci OA. 2025 Dec;11(1):2546259. doi: 10.1080/20565623.2025.2546259. Epub 2025 Sep 3.
2
Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.评估大语言模型生成的与专家编写的临床解剖学多项选择题:医学教育中基于学生认知的比较研究。
Med Educ Online. 2025 Dec;30(1):2554678. doi: 10.1080/10872981.2025.2554678. Epub 2025 Aug 30.
3
Application and ethical implication of generative artificial intelligence in medical education: a cross-sectional study among critical care academic physicians in China.

本文引用的文献

1
A toolbox for surfacing health equity harms and biases in large language models.一个用于揭示大语言模型中健康公平性危害和偏见的工具箱。
Nat Med. 2024 Dec;30(12):3590-3600. doi: 10.1038/s41591-024-03258-2. Epub 2024 Sep 23.
2
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.GPT-3 人工智能模型的诊断和分诊准确性:一项观察性研究。
Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.
3
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
生成式人工智能在医学教育中的应用及伦理意义:一项针对中国重症医学学术医师的横断面研究
BMC Med Educ. 2025 Aug 29;25(1):1225. doi: 10.1186/s12909-025-07825-0.
4
Comparative evaluation of large language models performance in medical education using urinary system histology assessment.使用泌尿系统组织学评估对大型语言模型在医学教育中的表现进行比较评估。
Sci Rep. 2025 Aug 29;15(1):31933. doi: 10.1038/s41598-025-17571-4.
5
An eyecare foundation model for clinical assistance: a randomized controlled trial.一种用于临床辅助的眼保健基础模型:一项随机对照试验。
Nat Med. 2025 Aug 28. doi: 10.1038/s41591-025-03900-7.
6
Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.日语胸部计算机断层扫描报告大规模数据集的开发及高性能发现分类模型:数据集开发与验证研究
JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.
7
Two stage large language model approach enhancing entity classification and relationship mapping in radiology reports.两阶段大语言模型方法增强放射学报告中的实体分类和关系映射
Sci Rep. 2025 Aug 27;15(1):31550. doi: 10.1038/s41598-025-16213-z.
8
GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.胃语大模型:一种概念验证型定制临床语言模型的开发与对照测试
Endosc Int Open. 2025 Aug 6;13:a26372163. doi: 10.1055/a-2637-2163. eCollection 2025.
9
Incentivizing inclusive contributions in model sharing markets.激励模型共享市场中的包容性贡献。
Nat Commun. 2025 Aug 25;16(1):7923. doi: 10.1038/s41467-025-62959-5.
10
Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data.通过利用网络规模的二维和三维医学数据构建放射学通用基础模型。
Nat Commun. 2025 Aug 23;16(1):7866. doi: 10.1038/s41467-025-62385-7.
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
4
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
5
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现:对其优缺点的分析。
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.
6
Analysis of large-language model versus human performance for genetics questions.大语言模型与人类在遗传学问题表现上的分析。
Eur J Hum Genet. 2024 Apr;32(4):466-468. doi: 10.1038/s41431-023-01396-8. Epub 2023 May 29.
7
ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models.ChatGPT走进手术室:在大语言模型时代评估GPT-4在外科教育与培训中的表现及其潜力。
Ann Surg Treat Res. 2023 May;104(5):269-273. doi: 10.4174/astr.2023.104.5.269. Epub 2023 Apr 28.
8
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
9
BioGPT: generative pre-trained transformer for biomedical text generation and mining.BioGPT:用于生物医学文本生成和挖掘的生成式预训练转换器。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.
10
Health inequities and the inappropriate use of race in nephrology.健康不公平和肾脏病学中种族的不当使用。
Nat Rev Nephrol. 2022 Feb;18(2):84-94. doi: 10.1038/s41581-021-00501-8. Epub 2021 Nov 8.