一种用于评估大型语言模型在医学文本摘要方面的临床安全性和幻觉率的框架。

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.

作者信息

Asgari Elham, Montaña-Brown Nina, Dubois Magda, Khalil Saleh, Balloch Jasmine, Yeung Joshua Au, Pimenta Dominic

机构信息

Tortus AI, London, UK.

Guy's and St Thomas NHS Trust, London, UK.

出版信息

NPJ Digit Med. 2025 May 13;8(1):274. doi: 10.1038/s41746-025-01670-7.

DOI:10.1038/s41746-025-01670-7

PMID:40360677

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12075489/

Abstract

Integrating large language models (LLMs) into healthcare can enhance workflow efficiency and patient care by automating tasks such as summarising consultations. However, the fidelity between LLM outputs and ground truth information is vital to prevent miscommunication that could lead to compromise in patient safety. We propose a framework comprising (1) an error taxonomy for classifying LLM outputs, (2) an experimental structure for iterative comparisons in our LLM document generation pipeline, (3) a clinical safety framework to evaluate the harms of errors, and (4) a graphical user interface, CREOLA, to facilitate these processes. Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation, consisting of 12,999 clinician-annotated sentences. We observed a 1.47% hallucination rate and a 3.45% omission rate. By refining prompts and workflows, we successfully reduced major errors below previously reported human note-taking rates, highlighting the framework's potential for safer clinical documentation.

摘要

将大语言模型（LLMs）整合到医疗保健中，可以通过自动执行诸如总结会诊等任务来提高工作流程效率和患者护理水平。然而，大语言模型输出与真实信息之间的保真度对于防止可能导致患者安全受损的沟通失误至关重要。我们提出了一个框架，包括（1）用于对大语言模型输出进行分类的错误分类法，（2）在我们的大语言模型文档生成管道中进行迭代比较的实验结构，（3）用于评估错误危害的临床安全框架，以及（4）一个图形用户界面CREOLA，以促进这些过程。我们的临床错误指标来自18种涉及用于生成临床记录的大语言模型的实验配置，包括12999条由临床医生注释的句子。我们观察到幻觉率为1.47%，遗漏率为3.45%。通过优化提示和工作流程，我们成功地将主要错误降低到低于先前报告的人工记录率，突出了该框架在更安全的临床文档记录方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c30f/12075489/a0eaa8b446d7/41746_2025_1670_Fig2_HTML.jpg

相似文献

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.一种用于评估大型语言模型在医学文本摘要方面的临床安全性和幻觉率的框架。

NPJ Digit Med. 2025 May 13;8(1):274. doi: 10.1038/s41746-025-01670-7.

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study.大型语言模型在医疗文档中的评估框架：开发和可用性研究。

J Med Internet Res. 2024 Nov 20;26:e58329. doi: 10.2196/58329.

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.利用大语言模型进行化疗诱导毒性的精准监测：一项专家比较及未来方向的试点研究

Cancers (Basel). 2024 Aug 12;16(16):2830. doi: 10.3390/cancers16162830.

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任：自我参照学习循环的后果及医疗保健专业人员的技能退化

J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.

Utilizing large language models for gastroenterology research: a conceptual framework.利用大语言模型进行胃肠病学研究：一个概念框架。

Therap Adv Gastroenterol. 2025 Apr 1;18:17562848251328577. doi: 10.1177/17562848251328577. eCollection 2025.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.将医学知识图谱融入大语言模型进行诊断预测：设计与应用研究

JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Potential of Large Language Models in Health Care: Delphi Study.大语言模型在医疗保健中的潜力：德尔菲研究。

J Med Internet Res. 2024 May 13;26:e52399. doi: 10.2196/52399.

Evaluating LLMs' grammatical error correction performance in learner Chinese.评估大语言模型在学习者汉语中的语法错误纠正表现。

PLoS One. 2024 Oct 30;19(10):e0312881. doi: 10.1371/journal.pone.0312881. eCollection 2024.

引用本文的文献

AI Agents in Clinical Medicine: A Systematic Review.临床医学中的人工智能代理：一项系统综述。

medRxiv. 2025 Aug 26:2025.08.22.25334232. doi: 10.1101/2025.08.22.25334232.

Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.表格合成健康数据中的幻觉对预后机器学习模型的影响程度及验证研究

J Med Internet Res. 2025 Aug 18;27:e77893. doi: 10.2196/77893.

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.由大语言模型模拟的合成医患对话：多维评估

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Swedish Medical LLM Benchmark: development and evaluation of a framework for assessing large language models in the Swedish medical domain.瑞典医学大语言模型基准：瑞典医学领域大语言模型评估框架的开发与评估

Front Artif Intell. 2025 Jul 11;8:1557920. doi: 10.3389/frai.2025.1557920. eCollection 2025.

Harm Reduction Strategies for Thoughtful Use of Large Language Models in the Medical Domain: Perspectives for Patients and Clinicians.医学领域审慎使用大语言模型的危害降低策略：患者与临床医生的视角

J Med Internet Res. 2025 Jul 25;27:e75849. doi: 10.2196/75849.

Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review.使用大语言模型进行临床文本摘要的科学证据：范围综述

J Med Internet Res. 2025 May 15;27:e68998. doi: 10.2196/68998.

本文引用的文献

The path forward for large language models in medicine is open.医学领域大语言模型的未来发展道路是开放的。

NPJ Digit Med. 2024 Nov 27;7(1):339. doi: 10.1038/s41746-024-01344-w.

A framework for human evaluation of large language models in healthcare derived from literature review.一个源自文献综述的用于医疗保健领域大语言模型人工评估的框架。

NPJ Digit Med. 2024 Sep 28;7(1):258. doi: 10.1038/s41746-024-01258-7.

Closing the gap between open source and commercial large language models for medical evidence summarization.弥合用于医学证据总结的开源大型语言模型与商业大型语言模型之间的差距。

NPJ Digit Med. 2024 Sep 9;7(1):239. doi: 10.1038/s41746-024-01239-w.

Detecting hallucinations in large language models using semantic entropy.使用语义熵检测大型语言模型中的幻觉。

Nature. 2024 Jun;630(8017):625-630. doi: 10.1038/s41586-024-07421-0. Epub 2024 Jun 19.

Augmented non-hallucinating large language models as medical information curators.增强型非幻觉大语言模型作为医学信息整理者

NPJ Digit Med. 2024 Apr 23;7(1):100. doi: 10.1038/s41746-024-01081-0.

Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review.电子健康记录的使用对临床医生认知负荷和职业倦怠的影响：叙述性综述

JMIR Med Inform. 2024 Apr 12;12:e55499. doi: 10.2196/55499.

Explaining Variability in Electronic Health Record Effort in Primary Care Ambulatory Encounters.解释初级保健门诊电子健康记录工作中的变异性。

Appl Clin Inform. 2024 Mar;15(2):212-219. doi: 10.1055/s-0044-1782228. Epub 2024 Mar 20.

Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。

Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.

Toward Clinical-Grade Evaluation of Large Language Models.迈向大语言模型的临床级评估。

Int J Radiat Oncol Biol Phys. 2024 Mar 15;118(4):916-920. doi: 10.1016/j.ijrobp.2023.11.012. Epub 2024 Feb 22.

The future landscape of large language models in medicine.医学领域大语言模型的未来前景。

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于评估大型语言模型在医学文本摘要方面的临床安全性和幻觉率的框架。

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献