Goodman Katherine E, Robinson Matthew L, Shams Seyed M, Beccar-Varela Pilar, Fiawoo Suiyini, Kwon Nathan, Lee Jae Hyoung, Vorsteg Abigail H, Taneja Monica, Magder Laurence S, Sutherland Mark, Sorongon Scott, Tamma Pranita D, Morgan Daniel J, Resnik Philip, Harris Anthony D, Klein Eili Y
The University of Maryland School of Medicine, Baltimore.
The University of Maryland Institute for Health Computing, North Bethesda.
JAMA Netw Open. 2025 May 1;8(5):e2512032. doi: 10.1001/jamanetworkopen.2025.12032.
An estimated half of all long-term care facility (LTCF) residents are colonized with antimicrobial-resistant organisms, and early identification of these patients on admission to acute care hospitals is a core strategy for preventing intrahospital spread. However, because LTCF exposure is not reliably captured in structured electronic health record data, LTCF-exposed patients routinely go undetected. Large language models (LLMs) offer a promising, but untested, opportunity for extracting this information from patient admission histories.
To evaluate the performance of an LLM against human review for identifying recent LTCF exposure from identifiable patient admission histories.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional, multicenter study used the history and physical (H&P) notes from unique, randomly sampled adult admissions occurring between January 1, 2016, and December 31, 2021, at 13 hospitals in the University of Maryland Medical System (UMMS) and the John Hopkins (Hopkins) health care system to compare the performance of an LLM (GPT-4-Turbo) using zero-shot learning and prompting against humans in identifying patients with recent LTCF exposure. LLM analyses were conducted from August to September 2024.
Recent (≤12 months) LTCF exposure documented in the H&P note, as adjudicated by (1) humans and (2) an LLM.
LLM sensitivity and specificity with Clopper-Pearson 95% CIs. Secondary outcomes were note review time and cost. The LLM was also prompted to provide a rationale and supporting note-text for each classification.
The study included 359 601 eligible adult admissions, of which 2087 randomly sampled H&P notes were manually reviewed at UMMS (1020 individuals; median [IQR] age, 58 [41-71] years; 493 [48%] male) and Hopkins (1067 individuals; median [IQR] age, 58 [48-67] years; 561 [53%] male) for LTCF residence. Compared with human review, the LLM achieved a sensitivity of 97% (95% CI, 91%-100%) and a specificity of 98% (95% CI, 97%-99%) at UMMS, and 96% (95% CI, 86%-100%) and 93% (95% CI, 92%-95%) sensitivity and specificity, respectively, at Hopkins; specificity at Hopkins improved with prompt revision (96% [95% CI, 95%-97%]). Of 117 manually reviewed LLM rationales, all were factually correct and quoted note-text accurately, and some demonstrated inferential logic and external knowledge. The LLM identified 37 (1.8%) human errors. Human review time had a mean of 2.5 minutes and cost $0.63 to $0.83 per note vs a mean of 4 to 6 seconds and $0.03 per note for LLM review.
In this 13-hospital study of 2087 adult admissions, an LLM accurately identified LTCF residence from H&P notes and was more than 25 times faster and 20 times less expensive than human review.
估计所有长期护理机构(LTCF)居民中有一半被抗菌药物耐药菌定植,在急性护理医院入院时尽早识别这些患者是预防医院内传播的核心策略。然而,由于结构化电子健康记录数据中无法可靠获取LTCF暴露情况,LTCF暴露患者常常未被发现。大语言模型(LLMs)为从患者入院病史中提取此类信息提供了一个有前景但未经测试的机会。
评估大语言模型在根据可识别的患者入院病史识别近期LTCF暴露方面相对于人工审核的性能。
设计、设置和参与者:这项横断面多中心研究使用了2016年1月1日至2021年12月31日期间在马里兰大学医学系统(UMMS)和约翰·霍普金斯(霍普金斯)医疗系统的13家医院中独特随机抽取的成人入院患者的病史和体格检查(H&P)记录,以比较使用零样本学习和提示的大语言模型(GPT-4-Turbo)与人工在识别近期有LTCF暴露患者方面的性能。大语言模型分析于2024年8月至9月进行。
H&P记录中记录的近期(≤12个月)LTCF暴露,由(1)人工和(2)大语言模型判定。
大语言模型的敏感性和特异性以及克洛普-皮尔逊95%置信区间。次要结局是记录审核时间和成本。大语言模型还被要求为每个分类提供理由和支持性记录文本。
该研究纳入了359601例符合条件的成人入院患者,其中在UMMS对2087份随机抽取的H&P记录进行了人工审核(1020人;年龄中位数[四分位间距],58[41 - 71]岁;493[48%]为男性),在霍普金斯对1067份记录进行了人工审核(1067人;年龄中位数[四分位间距],58[48 - 67]岁;561[53%]为男性)以确定LTCF居住情况。与人工审核相比,在UMMS大语言模型的敏感性为97%(95%置信区间,91% - 100%),特异性为98%(95%置信区间,97% - 99%);在霍普金斯,敏感性分别为96%(95%置信区间,86% - 100%)和特异性为93%(95%置信区间,92% - 95%);霍普金斯的特异性随着提示修订而提高(96%[95%置信区间,95% - 97%])。在117份人工审核的大语言模型理由中,所有理由在事实上都是正确的,并且准确引用了记录文本,有些还展示了推理逻辑和外部知识。大语言模型识别出37例(1.8%)人工错误。人工审核时间平均为2.5分钟,每份记录成本为0.63美元至0.83美元,而大语言模型审核平均为4至6秒,每份记录成本为0.03美元。
在这项涉及13家医院2087例成人入院患者的研究中,大语言模型能从H&P记录中准确识别LTCF居住情况,其速度比人工审核快25倍以上,成本比人工审核低20倍。