• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于临床对话总结的大语言模型的专家评估。

Expert evaluation of large language models for clinical dialogue summarization.

作者信息

Fraile Navarro David, Coiera Enrico, Hambly Thomas W, Triplett Zoe, Asif Nahyan, Susanto Anindya, Chowdhury Anamika, Azcoaga Lorenzo Amaya, Dras Mark, Berkovsky Shlomo

机构信息

Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Level 6, 75 Talavera Road, North Ryde, Sydney, NSW, 2113, Australia.

Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia.

出版信息

Sci Rep. 2025 Jan 7;15(1):1195. doi: 10.1038/s41598-024-84850-x.

DOI:10.1038/s41598-024-84850-x
PMID:39774141
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11707028/
Abstract

We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.

摘要

我们使用计算指标和人工评估来评估大语言模型总结临床对话的性能。比较了自动生成的摘要和人工生成的摘要。我们对五个语言模型进行了探索性评估:一个通用摘要模型、一个针对通用对话进行微调的模型、两个使用匿名临床对话进行微调的模型以及一个大语言模型(ChatGPT)。使用ROUGE、UniEval指标对这些模型进行评估,临床医生通过将生成的摘要与临床医生生成的摘要(黄金标准)进行比较来进行专家人工评估。在用ROUGE评估时,微调后的Transformer模型得分最高,而ChatGPT总体得分最低。然而,使用UniEval时,ChatGPT在所有评估领域得分最高(连贯性0.957、一致性0.7583、流畅性0.947、相关性0.947以及总体得分0.9891)。当由临床医生对这些系统进行评估时,也得到了类似的结果,ChatGPT在四个领域得分最高(连贯性0.573、一致性0.908、流畅性0.96以及总体临床实用性0.862)。统计分析表明,ChatGPT和人工摘要与所有其他模型之间存在差异。这些探索性结果表明,ChatGPT在总结临床对话方面的性能接近人工摘要的质量。该研究还发现,ROUGE指标可能无法可靠地评估临床摘要生成,而UniEval与人工评分相关性良好。大语言模型可能为临床对话总结自动化提供一条成功的途径。隐私问题和健康记录的受限性质仍然是其集成面临的挑战。需要使用多样化的临床对话和多个初始化种子进行进一步评估,以验证自动生成摘要的可靠性和通用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a8/11707028/9d311026cd58/41598_2024_84850_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a8/11707028/9d311026cd58/41598_2024_84850_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a8/11707028/9d311026cd58/41598_2024_84850_Fig1_HTML.jpg

相似文献

1
Expert evaluation of large language models for clinical dialogue summarization.用于临床对话总结的大语言模型的专家评估。
Sci Rep. 2025 Jan 7;15(1):1195. doi: 10.1038/s41598-024-84850-x.
2
Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences.探索 ChatGPT 在医学对话总结中的潜力:一项关于与人类偏好一致性的研究。
BMC Med Inform Decis Mak. 2024 Mar 14;24(1):75. doi: 10.1186/s12911-024-02481-8.
3
Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study.评估大语言模型在总结 MRI 膝关节影像学报告方面的表现与人类相比的性能:一项可行性研究。
Int J Med Inform. 2024 Jul;187:105443. doi: 10.1016/j.ijmedinf.2024.105443. Epub 2024 Apr 4.
4
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
5
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.探讨大型语言模型在总结心理健康咨询会话中的功效:基准研究。
JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.
6
Large language models vs human for classifying clinical documents.用于临床文档分类的大语言模型与人类对比
Int J Med Inform. 2025 Mar;195:105800. doi: 10.1016/j.ijmedinf.2025.105800. Epub 2025 Jan 21.
7
Text summarization with ChatGPT for drug labeling documents.利用 ChatGPT 进行药物标签文件的文本摘要。
Drug Discov Today. 2024 Jun;29(6):104018. doi: 10.1016/j.drudis.2024.104018. Epub 2024 May 7.
8
Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment.大语言模型在医学培训评估中的应用——以ChatGPT作为标准化病人:多指标评估
J Med Internet Res. 2025 Jan 1;27:e59435. doi: 10.2196/59435.
9
Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback.使用大语言模型的虚拟患者:具有反馈功能的临床医生-患者对话的可扩展、情境化模拟
J Med Internet Res. 2025 Apr 4;27:e68486. doi: 10.2196/68486.
10
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

引用本文的文献

1
Artificial intelligence in pediatric healthcare: current applications, potential, and implementation considerations.人工智能在儿科医疗保健中的应用:当前应用、潜力及实施考量
Clin Exp Pediatr. 2025 Sep;68(9):641-651. doi: 10.3345/cep.2025.00962. Epub 2025 Jun 25.
2
Comparing artificial intelligence- vs clinician-authored summaries of simulated primary care electronic health records.比较人工智能撰写的与临床医生撰写的模拟初级保健电子健康记录摘要。
JAMIA Open. 2025 Jul 30;8(4):ooaf082. doi: 10.1093/jamiaopen/ooaf082. eCollection 2025 Aug.
3
Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models.

本文引用的文献

1
Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study.评估大语言模型在总结 MRI 膝关节影像学报告方面的表现与人类相比的性能:一项可行性研究。
Int J Med Inform. 2024 Jul;187:105443. doi: 10.1016/j.ijmedinf.2024.105443. Epub 2024 Apr 4.
2
Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。
Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.
3
Implementing machine learning in medicine.
用于烟草研究中使用大语言模型的检索增强生成的可扩展评估框架。
Sci Rep. 2025 Jul 2;15(1):22760. doi: 10.1038/s41598-025-05726-2.
4
Verifiable Summarization of Electronic Health Records Using Large Language Models to Support Chart Review.使用大语言模型对电子健康记录进行可验证的摘要以支持病历审查。
medRxiv. 2025 Jun 3:2025.06.02.25328807. doi: 10.1101/2025.06.02.25328807.
在医学中实施机器学习。
CMAJ. 2021 Aug 30;193(34):E1351-E1357. doi: 10.1503/cmaj.202434. Epub 2021 Aug 29.
4
The Effect of Electronic Health Record Burden on Pediatricians' Work-Life Balance and Career Satisfaction.电子健康记录负担对儿科医生工作生活平衡和职业满意度的影响。
Appl Clin Inform. 2021 May;12(3):697-707. doi: 10.1055/s-0041-1732402. Epub 2021 Aug 2.
5
Exploring the relationship between electronic health records and provider burnout: A systematic review.探索电子健康记录与医疗服务人员职业倦怠之间的关系:一项系统综述。
J Am Med Inform Assoc. 2021 Apr 23;28(5):1009-1021. doi: 10.1093/jamia/ocab009.
6
Building the evidence-base to reduce electronic health record-related clinician burden.建立减少电子健康记录相关临床医生负担的证据基础。
J Am Med Inform Assoc. 2021 Apr 23;28(5):1057-1061. doi: 10.1093/jamia/ocaa238.
7
Identifying relevant information in medical conversations to summarize a clinician-patient encounter.在医患会话中识别相关信息以总结临床医生-患者的交流。
Health Informatics J. 2020 Dec;26(4):2906-2914. doi: 10.1177/1460458220951719. Epub 2020 Aug 29.
8
A network model of activities in primary care consultations.初级保健咨询活动的网络模型。
J Am Med Inform Assoc. 2019 Oct 1;26(10):1074-1082. doi: 10.1093/jamia/ocz046.
9
The digital scribe.数字记录员。
NPJ Digit Med. 2018 Oct 16;1:58. doi: 10.1038/s41746-018-0066-9. eCollection 2018.
10
[Reasons for General Practitioner Shortage – a Comparison Between France and Switzerland].[全科医生短缺的原因——法国与瑞士的比较]
Praxis (Bern 1994). 2016 May 25;105(11):619-36. doi: 10.1024/1661-8157/a002362.