Goh Ethan, Gallo Robert, Strong Eric, Weng Yingjie, Kerman Hannah, Freed Jason, Cool Joséphine A, Kanjee Zahir, Lane Kathleen P, Parsons Andrew S, Ahuja Neera, Horvitz Eric, Yang Daniel, Milstein Arnold, Olson Andrew P J, Hom Jason, Chen Jonathan H, Rodman Adam
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.
Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.
medRxiv. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485.
Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.
To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.
Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.
Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.
92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.
Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.
The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.
Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).
LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.
ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.
大语言模型(LLM)人工智能(AI)系统在诊断推理方面已显示出前景,但其在无明确正确答案的管理推理中的效用尚不清楚。
确定与传统资源相比,LLM辅助是否能提高医生在开放式管理推理任务中的表现。
2023年11月30日至2024年4月21日进行的前瞻性随机对照试验。
来自斯坦福大学、贝斯以色列女执事医疗中心和弗吉尼亚大学的多机构研究,涉及美国各地的医生。
92名在内科、家庭医学或急诊医学方面接受过培训的执业主治医师和住院医师。
呈现五个由专家开发的临床病例 vignettes,附带多个开放式管理问题以及通过德尔菲法创建的评分标准。医生被随机分配,一组除使用传统资源(如UpToDate、谷歌)外,还通过ChatGPT Plus使用GPT-4,另一组仅使用传统资源。
主要结局是两组在专家开发的评分标准上总分的差异。次要结局包括特定领域得分和每个病例花费的时间。
与使用传统资源的医生相比,使用LLM的医生得分更高(平均差异6.5%,95%置信区间2.7 - 10.2,p<0.001)。在管理决策(6.1%,95%置信区间2.5 - 9.7,p = 0.001)、诊断决策(12.1%,95%置信区间3.1 - 21.0,p = 0.009)和特定病例领域(6.2%,95%置信区间2.4 - 9.9,p = 0.002)方面有显著改善。使用GPT-4的用户每个病例花费的时间更多(平均差异119.3秒,95%置信区间17.4 - 221.2,p = 0.02)。使用GPT-4增强的医生与仅使用GPT-4的医生之间没有显著差异(-0.9%,95%置信区间 - 9.0至7.2,p = 0.8)。
与传统资源相比,LLM辅助改善了医生的管理推理,在情境和特定患者决策方面有特别的提升。这些发现表明LLMs可以增强复杂病例中的管理决策。
ClinicalTrials.gov标识符:NCT06208423;https://classic.clinicaltrials.gov/ct2/show/NCT06208423 。