开发和评估大语言模型生成的急诊医学交接班记录

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.

作者信息

Hartman Vince, Zhang Xinyuan, Poddar Ritika, McCarty Matthew, Fortenko Alexander, Sholle Evan, Sharma Rahul, Campion Thomas, Steel Peter A D

机构信息

Abstractive Health, New York, New York.

Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York.

出版信息

JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.

DOI:10.1001/jamanetworkopen.2024.48723

PMID:39625719

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11615705/

Abstract

IMPORTANCE

An emergency medicine (EM) handoff note generated by a large language model (LLM) has the potential to reduce physician documentation burden without compromising the safety of EM-to-inpatient (IP) handoffs.

OBJECTIVE

To develop LLM-generated EM-to-IP handoff notes and evaluate their accuracy and safety compared with physician-written notes.

DESIGN, SETTING, AND PARTICIPANTS: This cohort study used EM patient medical records with acute hospital admissions that occurred in 2023 at NewYork-Presbyterian/Weill Cornell Medical Center. A customized clinical LLM pipeline was trained, tested, and evaluated to generate templated EM-to-IP handoff notes. Using both conventional automated methods (ie, recall-oriented understudy for gisting evaluation [ROUGE], bidirectional encoder representations from transformers score [BERTScore], and source chunking approach for large-scale inconsistency evaluation [SCALE]) and a novel patient safety-focused framework, LLM-generated handoff notes vs physician-written notes were compared. Data were analyzed from October 2023 to March 2024.

EXPOSURE

LLM-generated EM handoff notes.

MAIN OUTCOMES AND MEASURES

LLM-generated handoff notes were evaluated for (1) lexical similarity with respect to physician-written notes using ROUGE and BERTScore; (2) fidelity with respect to source notes using SCALE; and (3) readability, completeness, curation, correctness, usefulness, and implications for patient safety using a novel framework.

RESULTS

In this study of 1600 EM patient records (832 [52%] female and mean [SD] age of 59.9 [18.9] years), LLM-generated handoff notes, compared with physician-written ones, had higher ROUGE (0.322 vs 0.088), BERTScore (0.859 vs 0.796), and SCALE scores (0.691 vs 0.456), indicating the LLM-generated summaries exhibited greater similarity and more detail. As reviewed by 3 board-certified EM physicians, a subsample of 50 LLM-generated summaries had a mean (SD) usefulness score of 4.04 (0.86) out of 5 (compared with 4.36 [0.71] for physician-written) and mean (SD) patient safety scores of 4.06 (0.86) out of 5 (compared with 4.50 [0.56] for physician-written). None of the LLM-generated summaries were classified as a critical patient safety risk.

CONCLUSIONS AND RELEVANCE

In this cohort study of 1600 EM patient medical records, LLM-generated EM-to-IP handoff notes were determined superior compared with physician-written summaries via conventional automated evaluation methods, but marginally inferior in usefulness and safety via a novel evaluation framework. This study suggests the importance of a physician-in-loop implementation design for this model and demonstrates an effective strategy to measure preimplementation patient safety of LLM models.

摘要

重要性

由大语言模型（LLM）生成的急诊医学（EM）交接班记录有潜力减轻医生的文档记录负担，同时不影响急诊到住院患者（IP）交接班的安全性。

目的

开发由LLM生成的EM到IP的交接班记录，并与医生手写记录相比，评估其准确性和安全性。

设计、设置和参与者：这项队列研究使用了2023年在纽约长老会/威尔康奈尔医学中心发生急性住院的EM患者病历。训练、测试和评估了一个定制的临床LLM管道，以生成模板化的EM到IP的交接班记录。使用传统的自动化方法（即用于摘要评估的召回导向替代方法[ROUGE]、来自变换器分数的双向编码器表示[BERTScore]以及用于大规模不一致评估的源分块方法[SCALE]）和一个新的以患者安全为重点的框架，比较了LLM生成的交接班记录与医生手写记录。对2023年10月至2024年3月的数据进行了分析。

暴露因素

LLM生成的EM交接班记录。

主要结局和测量指标

对LLM生成的交接班记录进行了以下评估：（1）使用ROUGE和BERTScore评估与医生手写记录的词汇相似性；（2）使用SCALE评估与源记录的保真度；（3）使用一个新框架评估可读性、完整性、整理、正确性、有用性以及对患者安全的影响。

结果

在这项对1600份EM患者记录（832例[52%]为女性，平均[标准差]年龄为59.9[18.9]岁）的研究中，与医生手写记录相比，LLM生成的交接班记录具有更高的ROUGE（0.322对0.088）、BERTScore（0.859对0.796）和SCALE分数（0.691对0.456），表明LLM生成的摘要表现出更大的相似性和更多细节。由3名获得委员会认证的EM医生进行审查，50份LLM生成的摘要子样本的平均（标准差）有用性评分为4.04（0.86）（满分5分，医生手写记录为4.36[0.71]），平均（标准差）患者安全评分为4.06（0.86）（满分5分，医生手写记录为4.50[0.56]）。没有一份LLM生成的摘要被归类为关键的患者安全风险。

结论和相关性

在这项对1600份EM患者病历的队列研究中，通过传统自动化评估方法确定，LLM生成的EM到IP的交接班记录优于医生手写摘要，但通过一个新的评估框架，在有用性和安全性方面略逊一筹。这项研究表明了该模型的医生参与式实施设计的重要性，并展示了一种测量LLM模型实施前患者安全的有效策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fc6/11615705/d29a76eefd33/jamanetwopen-e2448723-g001.jpg

相似文献

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.开发和评估大语言模型生成的急诊医学交接班记录

JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.

Physician- and Large Language Model-Generated Hospital Discharge Summaries.医生和大语言模型生成的医院出院小结

JAMA Intern Med. 2025 May 5. doi: 10.1001/jamainternmed.2025.0821.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4（GPT-4）分析三种不同语言医学笔记的潜力：一项回顾性模型评估研究。

Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.

Biased Language in Simulated Handoffs and Clinician Recall and Attitudes.模拟交接班、临床医生回忆及态度中的偏见性语言

JAMA Netw Open. 2024 Dec 2;7(12):e2450172. doi: 10.1001/jamanetworkopen.2024.50172.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病：关于肺栓塞的实证研究

J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响：一项随机临床试验。

JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.

Natural Language Processing of Clinical Documentation to Assess Functional Status in Patients With Heart Failure.临床文档的自然语言处理用于评估心力衰竭患者的功能状态。

JAMA Netw Open. 2024 Nov 4;7(11):e2443925. doi: 10.1001/jamanetworkopen.2024.43925.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

引用本文的文献

Evaluating Hospital Course Summarization by an Electronic Health Record-Based Large Language Model.基于电子健康记录的大语言模型评估医院病程总结

JAMA Netw Open. 2025 Aug 1;8(8):e2526339. doi: 10.1001/jamanetworkopen.2025.26339.

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.使用OpenAI的o1与其他大语言模型进行眼科问答和推理

JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.

Physician awareness of, interest in, and current use of artificial intelligence large language model-based virtual assistants.医生对基于人工智能大语言模型的虚拟助手的认知、兴趣及当前使用情况。

PLoS One. 2025 May 28;20(5):e0320749. doi: 10.1371/journal.pone.0320749. eCollection 2025.

Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations.医学领域的大语言模型：临床应用、技术挑战与伦理考量

Healthc Inform Res. 2025 Apr;31(2):114-124. doi: 10.4258/hir.2025.31.2.114. Epub 2025 Apr 30.

Mapping artificial intelligence models in emergency medicine: A scoping review on artificial intelligence performance in emergency care and education.绘制急诊医学中的人工智能模型：关于人工智能在急诊护理和教育中表现的范围综述。

Turk J Emerg Med. 2025 Apr 1;25(2):67-91. doi: 10.4103/tjem.tjem_45_25. eCollection 2025 Apr-Jun.

Artificial Intelligence in Relation to Accurate Information and Tasks in Gynecologic Oncology and Clinical Medicine-Dunning-Kruger Effects and Ultracrepidarianism.人工智能与妇科肿瘤学和临床医学中的准确信息及任务——邓宁-克鲁格效应和不懂装懂。

Diagnostics (Basel). 2025 Mar 15;15(6):735. doi: 10.3390/diagnostics15060735.

本文引用的文献

Systematic review: The use of large language models as medical chatbots in digestive diseases.系统评价：大语言模型在消化系统疾病中的医学聊天机器人应用。

Aliment Pharmacol Ther. 2024 Jul;60(2):144-166. doi: 10.1111/apt.18058. Epub 2024 May 27.

Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals.大语言模型与用户信任：自我参照学习循环的后果及医疗保健专业人员的技能退化

J Med Internet Res. 2024 Apr 25;26:e56764. doi: 10.2196/56764.

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone.专家、经过专家编辑的大语言模型或仅经过专家编辑的大语言模型对视网膜问题回答的比较研究。

Ophthalmol Sci. 2024 Feb 6;4(4):100485. doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.

Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned.电子健康记录中生成式人工智能对患者信息的回复：早期经验教训

JAMIA Open. 2024 Apr 10;7(2):ooae028. doi: 10.1093/jamiaopen/ooae028. eCollection 2024 Jul.

Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery.评估 GPT-4.0 大语言模型在神经外科中的安全性、准确性和有用性。

J Clin Neurosci. 2024 May;123:151-156. doi: 10.1016/j.jocn.2024.03.021. Epub 2024 Apr 4.

Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format.生成式人工智能将住院病历摘要转换为患者友好型语言和格式。

JAMA Netw Open. 2024 Mar 4;7(3):e240357. doi: 10.1001/jamanetworkopen.2024.0357.

Adapted large language models can outperform medical experts in clinical text summarization.经过改编的大型语言模型在临床文本总结方面的表现优于医学专家。

Nat Med. 2024 Apr;30(4):1134-1142. doi: 10.1038/s41591-024-02855-5. Epub 2024 Feb 27.

AI-Generated Clinical Summaries Require More Than Accuracy.人工智能生成的临床总结需要的不仅仅是准确性。

JAMA. 2024 Feb 27;331(8):637-638. doi: 10.1001/jama.2024.0555.

A method to automate the discharge summary hospital course for neurology patients.一种自动化神经内科患者出院小结住院流程的方法。

J Am Med Inform Assoc. 2023 Nov 17;30(12):1995-2003. doi: 10.1093/jamia/ocad177.

Evaluating large language models on medical evidence summarization.基于医学证据总结对大语言模型进行评估。

NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

开发和评估大语言模型生成的急诊医学交接班记录

Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVE

EXPOSURE

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论和相关性

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献