Goh Ethan, Gallo Robert, Hom Jason, Strong Eric, Weng Yingjie, Kerman Hannah, Cool Josephine, Kanjee Zahir, Parsons Andrew S, Ahuja Neera, Horvitz Eric, Yang Daniel, Milstein Arnold, Olson Andrew P J, Rodman Adam, Chen Jonathan H
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.
Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.
medRxiv. 2024 Mar 14:2024.03.12.24303785. doi: 10.1101/2024.03.12.24303785.
Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.
To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.
Multi-center, randomized clinical vignette study.
The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.
Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.
Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.
The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.
50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.
In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.
诊断错误很常见,并会导致严重的发病率。大语言模型(LLMs)在多项选择题和开放式医学推理考试中的表现已显示出前景,但使用此类工具是否能改善诊断推理仍不清楚。
评估GPT-4大语言模型与传统资源相比对医生诊断推理的影响。
多中心随机临床病例研究。
该研究通过与全国各地的医生进行远程视频会议以及在多个学术医疗机构进行现场参与来开展。
接受过家庭医学、内科或急诊医学培训的住院医师和主治医师。
参与者被随机分配,一组除了使用传统诊断资源外还可使用GPT-4,另一组仅使用传统资源。他们被分配60分钟来审查多达六个改编自既定诊断推理考试的临床病例。
主要结局是基于鉴别诊断准确性、支持和反对因素的适当性以及下一步诊断评估步骤的诊断表现。次要结局包括每个病例花费的时间和最终诊断。
50名医生(26名主治医师,24名住院医师)参与,每位参与者平均完成5.2个病例。GPT-4组每个病例的诊断推理得分中位数为76.3%(四分位间距65.8%至86.8%),传统资源组为73.7%(四分位间距63.2%至84.2%),调整后的差异为1.6个百分点(95%置信区间-4.4至7.6;p = 0.60)。GPT-4组每个病例花费的时间中位数为519秒(四分位间距371至668秒),传统资源组为565秒(四分位间距456至788秒),时间差为-82秒(95%置信区间-195至31;p = 0.20)。仅使用GPT-4的得分比传统资源组高15.5个百分点(95%置信区间1.5至29,p = 0.03)。
在一项基于临床病例的研究中,与传统资源相比,向医生提供GPT-4作为诊断辅助工具并没有显著改善临床推理,尽管它可能会改善临床推理的某些方面,如效率。仅使用GPT-4的表现高于两个医生组,这表明在临床实践中医生与人工智能合作还有进一步改进的机会。