Suppr超能文献

一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。

A dataset and benchmark for hospital course summarization with adapted large language models.

作者信息

Aali Asad, Van Veen Dave, Arefeen Yamin Ishraq, Hom Jason, Bluethgen Christian, Reis Eduardo Pontes, Gatidis Sergios, Clifford Namuun, Daws Joseph, Tehrani Arash S, Kim Jangwon, Chaudhari Akshay S

机构信息

Department of Radiology, Stanford University, Stanford, CA 94304, United States.

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.

出版信息

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Abstract

OBJECTIVE

Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel preprocessed dataset, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of 2 general-purpose LLMs and 3 healthcare-adapted LLMs.

MATERIALS AND METHODS

Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to 3 open-source LLMs (Clinical-T5-Large, Llama2-13B, and FLAN-UL2) and 2 proprietary LLMs (Generative Pre-trained Transformer [GPT]-3.5 and GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with 5 clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We compare reader preferences for the original and LLM-generated summary using Wilcoxon signed-rank tests. We further request optional qualitative feedback from clinicians to gain deeper insights into their preferences, and we present the frequency of common themes arising from these comments.

RESULTS

The Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of Bilingual Evaluation Understudy (BLEU) and Bidirectional Encoder Representations from Transformers (BERT)-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries (P<.001), highlighting the need for qualitative clinical evaluation.

DISCUSSION AND CONCLUSION

We release a foundational clinically relevant dataset, the MIMIC-IV-BHC, and present an open-source benchmark of LLM performance in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. Our research effectively integrates elements from the data assimilation pipeline: our methods use (1) clinical data sources to integrate, (2) data translation, and (3) knowledge creation, while our evaluation strategy paves the way for (4) deployment.

摘要

目的

简短住院病程(BHC)总结是概括患者住院情况的临床文档。虽然大语言模型(LLM)在自动化现实世界任务方面展现出卓越能力,但它们在医疗保健应用中的能力,如从临床记录中合成BHC,尚未得到证实。我们引入了一个新颖的预处理数据集,即MIMIC-IV-BHC,它封装了临床记录和BHC对,以使LLM适用于BHC合成。此外,我们还引入了2个通用LLM和3个医疗保健适配LLM的总结性能基准。

材料与方法

以临床记录为输入,我们将基于提示(使用上下文学习)和基于微调的适配策略应用于3个开源LLM(Clinical-T5-Large、Llama2-13B和FLAN-UL2)以及2个专有LLM(生成式预训练变换器[GPT]-3.5和GPT-4)。我们使用自然语言相似度指标在多个上下文长度输入上评估这些LLM。我们还与5名临床医生进行了一项临床研究,比较了临床医生编写的和LLM生成的30个样本的BHC,重点关注它们通过提高总结质量来增强临床决策的潜力。我们使用Wilcoxon符号秩检验比较读者对原始总结和LLM生成总结的偏好。我们还向临床医生征求了可选的定性反馈,以更深入地了解他们的偏好,并展示了这些评论中出现的常见主题的频率。

结果

根据双语评估替补(BLEU)和来自变换器的双向编码器表示(BERT)分数的定量评估指标,经过微调的Llama2-13B LLM优于其他领域适配模型。具有上下文学习的GPT-4在临床记录输入的上下文长度增加时,比经过微调的Llama2-13B表现出更强的稳健性。尽管定量指标相当,但读者研究表明,与经过微调的Llama2-13B总结和原始总结相比,读者明显更喜欢具有上下文学习的GPT-4生成的总结(P<0.001),这突出了定性临床评估的必要性。

讨论与结论

我们发布了一个基础的临床相关数据集,即MIMIC-IV-BHC,并展示了LLM从临床记录合成BHC的性能的开源基准。我们使用定量指标和定性临床读者研究观察到,上下文专有和微调开源LLM都具有高质量的总结性能。我们的研究有效地整合了数据同化管道中的要素:我们的方法使用(1)临床数据源进行整合,(2)数据转换,以及(3)知识创造,而我们的评估策略为(4)部署铺平了道路。

相似文献

1
一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
6
CACER:癌症事件与关系的临床概念注释。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.
8
近期大型语言模型在生成肺癌患者出院小结方面的比较研究。
J Biomed Inform. 2025 Aug;168:104867. doi: 10.1016/j.jbi.2025.104867. Epub 2025 Jun 20.
9
评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
10
基于开源大语言模型的乳腺癌治疗后患者为中心结局自动提取工具包。
JCO Clin Cancer Inform. 2024 Aug;8:e2300258. doi: 10.1200/CCI.23.00258.

引用本文的文献

1
人工智能生成的会诊总结:来自大型学术医疗系统门诊临床医生的早期见解。
JAMIA Open. 2025 Sep 2;8(5):ooaf096. doi: 10.1093/jamiaopen/ooaf096. eCollection 2025 Oct.
2
出院小结的自动生成:利用大语言模型结合临床数据
Sci Rep. 2025 May 12;15(1):16466. doi: 10.1038/s41598-025-01618-7.

本文引用的文献

2
经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。
Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.
3
DRG-LLaMA:调整LLaMA模型以预测住院患者的诊断相关分组
NPJ Digit Med. 2024 Jan 22;7(1):16. doi: 10.1038/s41746-023-00989-3.
4
变革医疗文档:利用人工智能的潜力生成出院小结。
BJGP Open. 2024 Apr 25;8(1). doi: 10.3399/BJGPO.2023.0116. Print 2024 Apr.
5
一种自动化神经内科患者出院小结住院流程的方法。
J Am Med Inform Assoc. 2023 Nov 17;30(12):1995-2003. doi: 10.1093/jamia/ocad177.
6
大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
8
ChatGPT 与眼科学:从出院小结和手术记录探索其潜力。
Semin Ophthalmol. 2023 Jul;38(5):503-507. doi: 10.1080/08820538.2023.2209166. Epub 2023 May 3.
9
出院小结 住院过程总结 电子健康记录中的文本 临床概念引导的深度预训练的 Transformer 模型
J Biomed Inform. 2023 May;141:104358. doi: 10.1016/j.jbi.2023.104358. Epub 2023 Apr 5.
10
ChatGPT:出院小结的未来?
Lancet Digit Health. 2023 Mar;5(3):e107-e108. doi: 10.1016/S2589-7500(23)00021-3. Epub 2023 Feb 6.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验