Hartman Vince, Zhang Xinyuan, Poddar Ritika, McCarty Matthew, Fortenko Alexander, Sholle Evan, Sharma Rahul, Campion Thomas, Steel Peter A D
Abstractive Health, New York, New York.
Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York.
JAMA Netw Open. 2024 Dec 2;7(12):e2448723. doi: 10.1001/jamanetworkopen.2024.48723.
An emergency medicine (EM) handoff note generated by a large language model (LLM) has the potential to reduce physician documentation burden without compromising the safety of EM-to-inpatient (IP) handoffs.
To develop LLM-generated EM-to-IP handoff notes and evaluate their accuracy and safety compared with physician-written notes.
DESIGN, SETTING, AND PARTICIPANTS: This cohort study used EM patient medical records with acute hospital admissions that occurred in 2023 at NewYork-Presbyterian/Weill Cornell Medical Center. A customized clinical LLM pipeline was trained, tested, and evaluated to generate templated EM-to-IP handoff notes. Using both conventional automated methods (ie, recall-oriented understudy for gisting evaluation [ROUGE], bidirectional encoder representations from transformers score [BERTScore], and source chunking approach for large-scale inconsistency evaluation [SCALE]) and a novel patient safety-focused framework, LLM-generated handoff notes vs physician-written notes were compared. Data were analyzed from October 2023 to March 2024.
LLM-generated EM handoff notes.
LLM-generated handoff notes were evaluated for (1) lexical similarity with respect to physician-written notes using ROUGE and BERTScore; (2) fidelity with respect to source notes using SCALE; and (3) readability, completeness, curation, correctness, usefulness, and implications for patient safety using a novel framework.
In this study of 1600 EM patient records (832 [52%] female and mean [SD] age of 59.9 [18.9] years), LLM-generated handoff notes, compared with physician-written ones, had higher ROUGE (0.322 vs 0.088), BERTScore (0.859 vs 0.796), and SCALE scores (0.691 vs 0.456), indicating the LLM-generated summaries exhibited greater similarity and more detail. As reviewed by 3 board-certified EM physicians, a subsample of 50 LLM-generated summaries had a mean (SD) usefulness score of 4.04 (0.86) out of 5 (compared with 4.36 [0.71] for physician-written) and mean (SD) patient safety scores of 4.06 (0.86) out of 5 (compared with 4.50 [0.56] for physician-written). None of the LLM-generated summaries were classified as a critical patient safety risk.
In this cohort study of 1600 EM patient medical records, LLM-generated EM-to-IP handoff notes were determined superior compared with physician-written summaries via conventional automated evaluation methods, but marginally inferior in usefulness and safety via a novel evaluation framework. This study suggests the importance of a physician-in-loop implementation design for this model and demonstrates an effective strategy to measure preimplementation patient safety of LLM models.
由大语言模型(LLM)生成的急诊医学(EM)交接班记录有潜力减轻医生的文档记录负担,同时不影响急诊到住院患者(IP)交接班的安全性。
开发由LLM生成的EM到IP的交接班记录,并与医生手写记录相比,评估其准确性和安全性。
设计、设置和参与者:这项队列研究使用了2023年在纽约长老会/威尔康奈尔医学中心发生急性住院的EM患者病历。训练、测试和评估了一个定制的临床LLM管道,以生成模板化的EM到IP的交接班记录。使用传统的自动化方法(即用于摘要评估的召回导向替代方法[ROUGE]、来自变换器分数的双向编码器表示[BERTScore]以及用于大规模不一致评估的源分块方法[SCALE])和一个新的以患者安全为重点的框架,比较了LLM生成的交接班记录与医生手写记录。对2023年10月至2024年3月的数据进行了分析。
LLM生成的EM交接班记录。
对LLM生成的交接班记录进行了以下评估:(1)使用ROUGE和BERTScore评估与医生手写记录的词汇相似性;(2)使用SCALE评估与源记录的保真度;(3)使用一个新框架评估可读性、完整性、整理、正确性、有用性以及对患者安全的影响。
在这项对1600份EM患者记录(832例[52%]为女性,平均[标准差]年龄为59.9[18.9]岁)的研究中,与医生手写记录相比,LLM生成的交接班记录具有更高的ROUGE(0.322对0.088)、BERTScore(0.859对0.796)和SCALE分数(0.691对0.456),表明LLM生成的摘要表现出更大的相似性和更多细节。由3名获得委员会认证的EM医生进行审查,50份LLM生成的摘要子样本的平均(标准差)有用性评分为4.04(0.86)(满分5分,医生手写记录为4.36[0.71]),平均(标准差)患者安全评分为4.06(0.86)(满分5分,医生手写记录为4.50[0.56])。没有一份LLM生成的摘要被归类为关键的患者安全风险。
在这项对1600份EM患者病历的队列研究中,通过传统自动化评估方法确定,LLM生成的EM到IP的交接班记录优于医生手写摘要,但通过一个新的评估框架,在有用性和安全性方面略逊一筹。这项研究表明了该模型的医生参与式实施设计的重要性,并展示了一种测量LLM模型实施前患者安全的有效策略。