Suppr超能文献

医生和大语言模型生成的医院出院小结

Physician- and Large Language Model-Generated Hospital Discharge Summaries.

作者信息

Williams Christopher Y K, Subramanian Charumathi Raghu, Ali Syed Salman, Apolinario Michael, Askin Elisabeth, Barish Peter, Cheng Monica, Deardorff W James, Donthi Nisha, Ganeshan Smitha, Huang Owen, Kantor Molly A, Lai Andrew R, Manchanda Ashley, Moore Kendra A, Muniyappa Anoop N, Nair Geethu, Patel Prashant P, Santhosh Lekshmi, Schneider Susan, Torres Shawn, Yukawa Michi, Hubbard Colin C, Rosner Benjamin I

机构信息

Bakar Computational Health Sciences Institute, University of California San Francisco.

Division of Hospital Medicine, University of California San Francisco.

出版信息

JAMA Intern Med. 2025 May 5. doi: 10.1001/jamainternmed.2025.0821.

Abstract

IMPORTANCE

High-quality discharge summaries are associated with improved patient outcomes, but contribute to clinical documentation burden. Large language models (LLMs) provide an opportunity to support physicians by drafting discharge summary narratives.

OBJECTIVE

To determine whether LLM-generated discharge summary narratives are of comparable quality and safety to those of physicians.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study conducted at the University of California, San Francisco included 100 randomly selected inpatient hospital medicine encounters of 3 to 6 days' duration between 2019 and 2022. The analysis took place in July 2024.

EXPOSURE

A blinded evaluation of physician- and LLM-generated narratives was performed in duplicate by 22 attending physician reviewers.

MAIN OUTCOMES AND MEASURES

Narratives were reviewed for overall quality, reviewer preference, comprehensiveness, concision, coherence, and 3 error types (inaccuracies, omissions, and hallucinations). Each error individually, and each narrative overall, were assigned potential harmfulness scores ranging from 0 to 7 on an adapted Agency for Healthcare Research and Quality scale.

RESULTS

Across 100 encounters, LLM- and physician-generated narratives were comparable in overall quality on a Likert scale ranging from 1 to 5 (higher scores indicate higher quality; mean [SD] score, 3.67 [0.49] vs 3.77 [0.57]; P = .21) and reviewer preference (χ2 = 5.2; P = .27). LLM-generated narratives were more concise (mean [SD] score, 4.01 [0.37] vs 3.70 [0.59]; P < .001) and more coherent (mean [SD] score, 4.16 [0.39] vs 4.01 [0.53]; P = .02) than their physician-generated counterparts, but less comprehensive (mean [SD] score, 3.72 [0.58] vs 4.13 [0.58]; P < .001). LLM-generated narratives contained more unique errors (mean [SD] errors per summary, 2.91 [2.54]) than physician-generated narratives (mean [SD] errors per summary, 1.82 [1.94]). There was no significant difference in the potential for harm between LLM- and physician-generated narratives across individual errors (mean [SD] of 1.35 [1.07] vs 1.34 [1.05]; P = .99), with 6 and 5 individual errors, respectively, with scores of 4 (potential for permanent harm) or greater. Both LLM- and physician-generated narratives had low overall potential for harm (scores <1 on a scale ranging from 0-7), with LLM-generated narratives scoring higher than physician narratives (mean [SD] score of 0.84 [0.98] vs 0.36 [0.70]; P < .001) and only 1 LLM-generated narrative (compared with 0 physician-generated narratives) scoring 4 or greater.

CONCLUSIONS AND RELEVANCE

In this cross-sectional study of 100 inpatient hospital medicine encounters, LLM-generated discharge summary narratives were of comparable quality, and were preferred equally, to those generated by physicians. LLM-generated narratives were more likely to contain errors but had low overall harmfulness scores. These results suggest that, in clinical practice, using such narratives after human review may provide a viable option for hospitalists.

摘要

重要性

高质量的出院小结与改善患者预后相关,但会增加临床文档负担。大语言模型(LLMs)为通过起草出院小结叙述来支持医生提供了机会。

目的

确定大语言模型生成的出院小结叙述在质量和安全性上是否与医生生成的相当。

设计、设置和参与者:这项横断面研究在加利福尼亚大学旧金山分校进行,纳入了2019年至2022年期间随机选择的100例住院3至6天的内科住院病例。分析于2024年7月进行。

暴露因素

22位主治医师评审员对医生和大语言模型生成的叙述进行了双盲评估。

主要结局和测量指标

对叙述的整体质量、评审员偏好、全面性、简洁性、连贯性以及3种错误类型(不准确、遗漏和幻觉)进行评审。根据美国医疗保健研究与质量局改编的量表,分别为每个错误以及每个叙述整体赋予0至7的潜在危害评分。

结果

在100例病例中,大语言模型和医生生成的叙述在1至5的李克特量表上的整体质量相当(分数越高表明质量越高;均值[标准差]分数,3.67[0.49]对3.77[0.57];P = 0.21)以及评审员偏好方面(χ2 = 5.2;P = 0.27)。大语言模型生成的叙述比医生生成的更简洁(均值[标准差]分数,4.01[0.37]对3.70[0.59];P < 0.001)且更连贯(均值[标准差]分数,4.16[0.39]对4.01[0.53];P = 0.02),但不如医生生成的全面(均值[标准差]分数,3.72[0.58]对4.13[0.58];P < 0.001)。大语言模型生成的叙述比医生生成的叙述包含更多独特错误(每个小结的均值[标准差]错误数,2.91[2.54])(每个小结的均值[标准差]错误数,1.82[1.94])。在个体错误方面,大语言模型和医生生成的叙述之间的潜在危害没有显著差异(均值[标准差]为1.35[1.07]对1.34[1.05];P = 0.99),分别有6个和5个个体错误的评分达到4分(永久性危害可能性)或更高。大语言模型和医生生成的叙述整体潜在危害都较低(在0至7的量表上分数<1),大语言模型生成的叙述得分高于医生叙述(均值[标准差]分数为0.84[0.98]对0.36[0.70];P < 0.001),且只有1个大语言模型生成的叙述(相比之下医生生成的叙述为0个)评分达到4分或更高。

结论和相关性

在这项对100例内科住院病例的横断面研究中,大语言模型生成的出院小结叙述在质量上与医生生成的相当,且被同等偏好。大语言模型生成的叙述更可能包含错误,但整体危害评分较低。这些结果表明,在临床实践中,经人工审核后使用此类叙述可能为住院医师提供一个可行的选择。

相似文献

3
Evaluating large language models for drafting emergency department encounter summaries.评估大型语言模型用于起草急诊科就诊总结。
PLOS Digit Health. 2025 Jun 17;4(6):e0000899. doi: 10.1371/journal.pdig.0000899. eCollection 2025 Jun.
5
Systemic treatments for metastatic cutaneous melanoma.转移性皮肤黑色素瘤的全身治疗
Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.
7
Sertindole for schizophrenia.用于治疗精神分裂症的舍吲哚。
Cochrane Database Syst Rev. 2005 Jul 20;2005(3):CD001715. doi: 10.1002/14651858.CD001715.pub2.

本文引用的文献

4
Large Language Model-Based Responses to Patients' In-Basket Messages.基于大语言模型的患者收件箱消息回复。
JAMA Netw Open. 2024 Jul 1;7(7):e2422399. doi: 10.1001/jamanetworkopen.2024.22399.
5
Detecting hallucinations in large language models using semantic entropy.使用语义熵检测大型语言模型中的幻觉。
Nature. 2024 Jun;630(8017):625-630. doi: 10.1038/s41586-024-07421-0. Epub 2024 Jun 19.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验