经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。

Adapted large language models can outperform medical experts in clinical text summarization.

机构信息

Department of Electrical Engineering, Stanford University, Stanford, CA, USA.

Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA.

出版信息

Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.

DOI:10.1038/s41591-024-02855-5

PMID:38413730

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11479659/

Abstract

Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range of clinical summarization tasks remains unproven. Here we applied adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes and doctor-patient dialogue. Quantitative assessments with syntactic, semantic and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.

摘要

分析大量的文本数据并从电子健康记录中总结关键信息，这给临床医生如何分配时间带来了很大的负担。尽管大型语言模型 (LLM) 在自然语言处理 (NLP) 任务中表现出了很大的潜力，但它们在各种临床总结任务中的有效性尚未得到证实。在这里，我们应用了适应方法来评估八个 LLM，涵盖了四个不同的临床总结任务：放射学报告、患者问题、进度记录和医患对话。使用句法、语义和概念 NLP 指标进行的定量评估揭示了模型和适应方法之间的权衡。一项有 10 名医生参与的临床读者研究评估了摘要的完整性、正确性和简洁性；在大多数情况下，我们最好适应的 LLM 生成的摘要被认为与医学专家生成的摘要一样（45%）或更好（36%）。随后的安全性分析突出了 LLM 和医学专家都面临的挑战，因为我们将错误与潜在的医疗伤害联系起来，并对编造信息的类型进行分类。我们的研究提供了证据，证明 LLM 在多个任务中的临床文本总结表现优于医学专家。这表明将 LLM 集成到临床工作流程中可以减轻文档编制的负担，使临床医生能够更多地关注患者护理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd07/11479659/ab4cbbd7435c/nihms-2023710-f0007.jpg

相似文献

Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。

Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts.临床文本摘要：适配大语言模型可超越人类专家。

Res Sq. 2023 Oct 30:rs.3.rs-3483777. doi: 10.21203/rs.3.rs-3483777/v1.

Evaluation of an automated knowledge-based textual summarization system for longitudinal clinical data, in the intensive care domain.评估一个自动化的基于知识的文本摘要系统在重症监护领域的纵向临床数据中的应用。

Artif Intell Med. 2017 Oct;82:20-33. doi: 10.1016/j.artmed.2017.09.001. Epub 2017 Sep 27.

Text summarization with ChatGPT for drug labeling documents.利用 ChatGPT 进行药物标签文件的文本摘要。

Drug Discov Today. 2024 Jun;29(6):104018. doi: 10.1016/j.drudis.2024.104018. Epub 2024 May 7.

Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力：德尔菲研究。

J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models.在支持大型语言模型提出的诊断生成中 UMLS 的作用。

J Biomed Inform. 2024 Sep;157:104707. doi: 10.1016/j.jbi.2024.104707. Epub 2024 Aug 13.

Impact of a Digital Scribe System on Clinical Documentation Time and Quality: Usability Study.数字抄写系统对临床文档记录时间和质量的影响：可用性研究

JMIR AI. 2024 Sep 23;3:e60020. doi: 10.2196/60020.

SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization.SPeC：一种基于软提示的大型语言模型在临床笔记总结中性能变异性的校准方法。

J Biomed Inform. 2024 Mar;151:104606. doi: 10.1016/j.jbi.2024.104606. Epub 2024 Feb 5.

Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study.通过症状描述和总结调整大型语言模型以增强精神病学访谈：初步研究。

JMIR Form Res. 2024 Oct 24;8:e58418. doi: 10.2196/58418.

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.探讨大型语言模型在总结心理健康咨询会话中的功效：基准研究。

JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.

引用本文的文献

From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine.从大语言模型到多模态人工智能：关于生成式人工智能在医学领域潜力的范围综述

Biomed Eng Lett. 2025 Aug 22;15(5):845-863. doi: 10.1007/s13534-025-00497-1. eCollection 2025 Sep.

Multimodal integration strategies for clinical application in oncology.肿瘤学临床应用中的多模态整合策略

Front Pharmacol. 2025 Aug 20;16:1609079. doi: 10.3389/fphar.2025.1609079. eCollection 2025.

Quality and efficiency of integrating customised large language model-generated summaries versus physician-written summaries: a validation study.整合定制的大语言模型生成的摘要与医生撰写的摘要的质量和效率：一项验证研究。

BMJ Open. 2025 Sep 4;15(9):e099301. doi: 10.1136/bmjopen-2025-099301.

AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study.移植护理中人工智能驱动的他克莫司给药：队列研究

JMIR AI. 2025 Sep 2;4:e67302. doi: 10.2196/67302.

Leveraging Large Language Models in Extracting Drug Safety Information from Prescription Drug Labels.利用大语言模型从处方药标签中提取药物安全信息。

Drug Saf. 2025 Sep 2. doi: 10.1007/s40264-025-01594-x.

Performance and improvement strategies for adapting generative large language models for electronic health record applications: A systematic review.将生成式大语言模型应用于电子健康记录的性能及改进策略：一项系统综述

Int J Med Inform. 2025 Aug 28;205:106091. doi: 10.1016/j.ijmedinf.2025.106091.

A Scoping Review of the Role of Artificial Intelligence in Physician Burnout.人工智能在医生职业倦怠中作用的范围综述

Cureus. 2025 Jul 23;17(7):e88580. doi: 10.7759/cureus.88580. eCollection 2025 Jul.

Large language models for clinical decision support in gastroenterology and hepatology.用于胃肠病学和肝病学临床决策支持的大语言模型

Nat Rev Gastroenterol Hepatol. 2025 Aug 22. doi: 10.1038/s41575-025-01108-1.

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model.基于电子健康记录的大语言模型评估医院病程总结

JAMA Netw Open. 2025 Aug 1;8(8):e2526339. doi: 10.1001/jamanetworkopen.2025.26339.

Evaluating gender bias in large language models in long-term care.评估长期护理中大型语言模型的性别偏见。

BMC Med Inform Decis Mak. 2025 Aug 11;25(1):274. doi: 10.1186/s12911-025-03118-0.

本文引用的文献

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力：一项模型评估研究。

Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.

Large language models propagate race-based medicine.大语言模型传播基于种族的医学观念。

NPJ Digit Med. 2023 Oct 20;6(1):195. doi: 10.1038/s41746-023-00939-z.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。

Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.

Evaluating progress in automatic chest X-ray radiology report generation.评估自动胸部X光放射学报告生成的进展。

Patterns (N Y). 2023 Aug 3;4(9):100802. doi: 10.1016/j.patter.2023.100802. eCollection 2023 Sep 8.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现：大型语言模型的基准测试。

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

The shaky foundations of large language models and foundation models for electronic health records.用于电子健康记录的大语言模型和基础模型的不稳定基础。

NPJ Digit Med. 2023 Jul 29;6(1):135. doi: 10.1038/s41746-023-00879-8.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models.用于通过大语言模型进行即席任务适配的交互式和可视化提示工程

IEEE Trans Vis Comput Graph. 2023 Jan;29(1):1146-1156. doi: 10.1109/TVCG.2022.3209479. Epub 2022 Dec 16.

Novel electronic health record (EHR) education intervention in large healthcare organization improves quality, efficiency, time, and impact on burnout.大型医疗保健机构中的新型电子健康记录（EHR）教育干预可提高质量、效率、节省时间并减轻职业倦怠。

Medicine (Baltimore). 2018 Sep;97(38):e12319. doi: 10.1097/MD.0000000000012319.

A usability and safety analysis of electronic health records: a multi-center study.电子健康记录的可用性和安全性分析：一项多中心研究。

J Am Med Inform Assoc. 2018 Sep 1;25(9):1197-1201. doi: 10.1093/jamia/ocy088.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。

Adapted large language models can outperform medical experts in clinical text summarization.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献