Li Yiming, Li Fang, Hong Na, Li Manqi, Roberts Kirk, Cui Licong, Tao Cui, Xu Hua
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL 32224, USA.
J Biomed Inform. 2025 Aug;168:104867. doi: 10.1016/j.jbi.2025.104867. Epub 2025 Jun 20.
Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings.
Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L), semantic similarity scores, and manual evaluation of clinical relevance, factual faithfulness, and completeness. An iterative method was further tested on LLaMA 3 8b using clinical notes of varying lengths to examine the stability of its performance.
The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while manual evaluation further revealed that GPT-4 achieved the highest scores in relevance (4.95 ± 0.22) and factual faithfulness (4.40 ± 0.50), whereas GPT-4o performed best in completeness (4.55 ± 0.69); both models showed comparable overall quality. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing the underlying meaning and context of clinical narratives.
This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.
撰写出院小结是临床实践中一项至关重要但耗时的任务,对于传达相关患者信息和促进连续护理至关重要。大语言模型(LLMs)的最新进展显著增强了其理解和总结复杂医学文本的能力。本研究旨在探讨大语言模型如何减轻人工总结的负担、提高工作流程效率,并支持医疗环境中的明智决策。
使用了1099名肺癌患者的临床记录,其中50名患者的子集用于测试目的,102名患者用于模型微调。本研究评估了多个大语言模型,包括GPT-3.5、GPT-4、GPT-4o和LLaMA 3 8b,在生成出院小结方面的性能。评估指标包括词元级分析(BLEU、ROUGE-1、ROUGE-2、ROUGE-L)、语义相似性得分,以及对临床相关性、事实准确性和完整性的人工评估。使用不同长度的临床记录对LLaMA 3 8b进一步测试了一种迭代方法,以检验其性能的稳定性。
研究发现大语言模型在总结能力方面存在显著差异。GPT-4o和微调后的LLaMA 3表现出更优的词元级评估指标,而人工评估进一步显示,GPT-4在相关性(4.95±0.22)和事实准确性(4.40±0.50)方面得分最高,而GPT-4o在完整性(4.55±0.69)方面表现最佳;两个模型的整体质量相当。语义相似性得分表明GPT-4o和LLaMA 3是捕捉临床叙述潜在含义和背景的领先模型。
本研究为大语言模型生成出院小结的功效提供了见解,强调了自动总结工具提高文档准确性和效率的潜力,最终改善医疗环境中的患者护理和运营能力。