Fraile Navarro David, Coiera Enrico, Hambly Thomas W, Triplett Zoe, Asif Nahyan, Susanto Anindya, Chowdhury Anamika, Azcoaga Lorenzo Amaya, Dras Mark, Berkovsky Shlomo
Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney, NSW, 2113, Australia.
Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia.
Sci Rep. 2025 Jan 7;15(1):1195. doi: 10.1038/s41598-024-84850-x.
We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
我们使用计算指标和人工评估来评估大语言模型总结临床对话的性能。比较了自动生成的摘要和人工生成的摘要。我们对五个语言模型进行了探索性评估:一个通用摘要模型、一个针对通用对话进行微调的模型、两个使用匿名临床对话进行微调的模型以及一个大语言模型(ChatGPT)。使用ROUGE、UniEval指标对这些模型进行评估,临床医生通过将生成的摘要与临床医生生成的摘要(黄金标准)进行比较来进行专家人工评估。在用ROUGE评估时,微调后的Transformer模型得分最高,而ChatGPT总体得分最低。然而,使用UniEval时,ChatGPT在所有评估领域得分最高(连贯性0.957、一致性0.7583、流畅性0.947、相关性0.947以及总体得分0.9891)。当由临床医生对这些系统进行评估时,也得到了类似的结果,ChatGPT在四个领域得分最高(连贯性0.573、一致性0.908、流畅性0.96以及总体临床实用性0.862)。统计分析表明,ChatGPT和人工摘要与所有其他模型之间存在差异。这些探索性结果表明,ChatGPT在总结临床对话方面的性能接近人工摘要的质量。该研究还发现,ROUGE指标可能无法可靠地评估临床摘要生成,而UniEval与人工评分相关性良好。大语言模型可能为临床对话总结自动化提供一条成功的途径。隐私问题和健康记录的受限性质仍然是其集成面临的挑战。需要使用多样化的临床对话和多个初始化种子进行进一步评估,以验证自动生成摘要的可靠性和通用性。