一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

A dataset and benchmark for hospital course summarization with adapted large language models.

作者信息

Aali Asad, Van Veen Dave, Arefeen Yamin Ishraq, Hom Jason, Bluethgen Christian, Reis Eduardo Pontes, Gatidis Sergios, Clifford Namuun, Daws Joseph, Tehrani Arash S, Kim Jangwon, Chaudhari Akshay S

机构信息

Department of Radiology, Stanford University, Stanford, CA 94304, United States.

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.

出版信息

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

DOI:10.1093/jamia/ocae312

PMID:39786555

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11833472/

Abstract

OBJECTIVE

Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel preprocessed dataset, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of 2 general-purpose LLMs and 3 healthcare-adapted LLMs.

MATERIALS AND METHODS

Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to 3 open-source LLMs (Clinical-T5-Large, Llama2-13B, and FLAN-UL2) and 2 proprietary LLMs (Generative Pre-trained Transformer [GPT]-3.5 and GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with 5 clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We compare reader preferences for the original and LLM-generated summary using Wilcoxon signed-rank tests. We further request optional qualitative feedback from clinicians to gain deeper insights into their preferences, and we present the frequency of common themes arising from these comments.

RESULTS

The Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of Bilingual Evaluation Understudy (BLEU) and Bidirectional Encoder Representations from Transformers (BERT)-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries (P<.001), highlighting the need for qualitative clinical evaluation.

DISCUSSION AND CONCLUSION

We release a foundational clinically relevant dataset, the MIMIC-IV-BHC, and present an open-source benchmark of LLM performance in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. Our research effectively integrates elements from the data assimilation pipeline: our methods use (1) clinical data sources to integrate, (2) data translation, and (3) knowledge creation, while our evaluation strategy paves the way for (4) deployment.

摘要

目的

简短住院病程（BHC）总结是概括患者住院情况的临床文档。虽然大语言模型（LLM）在自动化现实世界任务方面展现出卓越能力，但它们在医疗保健应用中的能力，如从临床记录中合成BHC，尚未得到证实。我们引入了一个新颖的预处理数据集，即MIMIC-IV-BHC，它封装了临床记录和BHC对，以使LLM适用于BHC合成。此外，我们还引入了2个通用LLM和3个医疗保健适配LLM的总结性能基准。

材料与方法

以临床记录为输入，我们将基于提示（使用上下文学习）和基于微调的适配策略应用于3个开源LLM（Clinical-T5-Large、Llama2-13B和FLAN-UL2）以及2个专有LLM（生成式预训练变换器[GPT]-3.5和GPT-4）。我们使用自然语言相似度指标在多个上下文长度输入上评估这些LLM。我们还与5名临床医生进行了一项临床研究，比较了临床医生编写的和LLM生成的30个样本的BHC，重点关注它们通过提高总结质量来增强临床决策的潜力。我们使用Wilcoxon符号秩检验比较读者对原始总结和LLM生成总结的偏好。我们还向临床医生征求了可选的定性反馈，以更深入地了解他们的偏好，并展示了这些评论中出现的常见主题的频率。

结果

根据双语评估替补（BLEU）和来自变换器的双向编码器表示（BERT）分数的定量评估指标，经过微调的Llama2-13B LLM优于其他领域适配模型。具有上下文学习的GPT-4在临床记录输入的上下文长度增加时，比经过微调的Llama2-13B表现出更强的稳健性。尽管定量指标相当，但读者研究表明，与经过微调的Llama2-13B总结和原始总结相比，读者明显更喜欢具有上下文学习的GPT-4生成的总结（P<0.001），这突出了定性临床评估的必要性。

讨论与结论

我们发布了一个基础的临床相关数据集，即MIMIC-IV-BHC，并展示了LLM从临床记录合成BHC的性能的开源基准。我们使用定量指标和定性临床读者研究观察到，上下文专有和微调开源LLM都具有高质量的总结性能。我们的研究有效地整合了数据同化管道中的要素：我们的方法使用（1）临床数据源进行整合，（2）数据转换，以及（3）知识创造，而我们的评估策略为（4）部署铺平了道路。