Williams Christopher Y K, Subramanian Charumathi Raghu, Ali Syed Salman, Apolinario Michael, Askin Elisabeth, Barish Peter, Cheng Monica, Deardorff W James, Donthi Nisha, Ganeshan Smitha, Huang Owen, Kantor Molly A, Lai Andrew R, Manchanda Ashley, Moore Kendra A, Muniyappa Anoop N, Nair Geethu, Patel Prashant P, Santhosh Lekshmi, Schneider Susan, Torres Shawn, Yukawa Michi, Hubbard Colin C, Rosner Benjamin I
Bakar Computational Health Sciences Institute, University of California San Francisco.
Division of Hospital Medicine, University of California San Francisco.
JAMA Intern Med. 2025 May 5. doi: 10.1001/jamainternmed.2025.0821.
High-quality discharge summaries are associated with improved patient outcomes, but contribute to clinical documentation burden. Large language models (LLMs) provide an opportunity to support physicians by drafting discharge summary narratives.
To determine whether LLM-generated discharge summary narratives are of comparable quality and safety to those of physicians.
DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study conducted at the University of California, San Francisco included 100 randomly selected inpatient hospital medicine encounters of 3 to 6 days' duration between 2019 and 2022. The analysis took place in July 2024.
A blinded evaluation of physician- and LLM-generated narratives was performed in duplicate by 22 attending physician reviewers.
Narratives were reviewed for overall quality, reviewer preference, comprehensiveness, concision, coherence, and 3 error types (inaccuracies, omissions, and hallucinations). Each error individually, and each narrative overall, were assigned potential harmfulness scores ranging from 0 to 7 on an adapted Agency for Healthcare Research and Quality scale.
Across 100 encounters, LLM- and physician-generated narratives were comparable in overall quality on a Likert scale ranging from 1 to 5 (higher scores indicate higher quality; mean [SD] score, 3.67 [0.49] vs 3.77 [0.57]; P = .21) and reviewer preference (χ2 = 5.2; P = .27). LLM-generated narratives were more concise (mean [SD] score, 4.01 [0.37] vs 3.70 [0.59]; P < .001) and more coherent (mean [SD] score, 4.16 [0.39] vs 4.01 [0.53]; P = .02) than their physician-generated counterparts, but less comprehensive (mean [SD] score, 3.72 [0.58] vs 4.13 [0.58]; P < .001). LLM-generated narratives contained more unique errors (mean [SD] errors per summary, 2.91 [2.54]) than physician-generated narratives (mean [SD] errors per summary, 1.82 [1.94]). There was no significant difference in the potential for harm between LLM- and physician-generated narratives across individual errors (mean [SD] of 1.35 [1.07] vs 1.34 [1.05]; P = .99), with 6 and 5 individual errors, respectively, with scores of 4 (potential for permanent harm) or greater. Both LLM- and physician-generated narratives had low overall potential for harm (scores <1 on a scale ranging from 0-7), with LLM-generated narratives scoring higher than physician narratives (mean [SD] score of 0.84 [0.98] vs 0.36 [0.70]; P < .001) and only 1 LLM-generated narrative (compared with 0 physician-generated narratives) scoring 4 or greater.
In this cross-sectional study of 100 inpatient hospital medicine encounters, LLM-generated discharge summary narratives were of comparable quality, and were preferred equally, to those generated by physicians. LLM-generated narratives were more likely to contain errors but had low overall harmfulness scores. These results suggest that, in clinical practice, using such narratives after human review may provide a viable option for hospitalists.
高质量的出院小结与改善患者预后相关,但会增加临床文档负担。大语言模型(LLMs)为通过起草出院小结叙述来支持医生提供了机会。
确定大语言模型生成的出院小结叙述在质量和安全性上是否与医生生成的相当。
设计、设置和参与者:这项横断面研究在加利福尼亚大学旧金山分校进行,纳入了2019年至2022年期间随机选择的100例住院3至6天的内科住院病例。分析于2024年7月进行。
22位主治医师评审员对医生和大语言模型生成的叙述进行了双盲评估。
对叙述的整体质量、评审员偏好、全面性、简洁性、连贯性以及3种错误类型(不准确、遗漏和幻觉)进行评审。根据美国医疗保健研究与质量局改编的量表,分别为每个错误以及每个叙述整体赋予0至7的潜在危害评分。
在100例病例中,大语言模型和医生生成的叙述在1至5的李克特量表上的整体质量相当(分数越高表明质量越高;均值[标准差]分数,3.67[0.49]对3.77[0.57];P = 0.21)以及评审员偏好方面(χ2 = 5.2;P = 0.27)。大语言模型生成的叙述比医生生成的更简洁(均值[标准差]分数,4.01[0.37]对3.70[0.59];P < 0.001)且更连贯(均值[标准差]分数,4.16[0.39]对4.01[0.53];P = 0.02),但不如医生生成的全面(均值[标准差]分数,3.72[0.58]对4.13[0.58];P < 0.001)。大语言模型生成的叙述比医生生成的叙述包含更多独特错误(每个小结的均值[标准差]错误数,2.91[2.54])(每个小结的均值[标准差]错误数,1.82[1.94])。在个体错误方面,大语言模型和医生生成的叙述之间的潜在危害没有显著差异(均值[标准差]为1.35[1.07]对1.34[1.05];P = 0.99),分别有6个和5个个体错误的评分达到4分(永久性危害可能性)或更高。大语言模型和医生生成的叙述整体潜在危害都较低(在0至7的量表上分数<1),大语言模型生成的叙述得分高于医生叙述(均值[标准差]分数为0.84[0.98]对0.36[0.70];P < 0.001),且只有1个大语言模型生成的叙述(相比之下医生生成的叙述为0个)评分达到4分或更高。
在这项对100例内科住院病例的横断面研究中,大语言模型生成的出院小结叙述在质量上与医生生成的相当,且被同等偏好。大语言模型生成的叙述更可能包含错误,但整体危害评分较低。这些结果表明,在临床实践中,经人工审核后使用此类叙述可能为住院医师提供一个可行的选择。