大语言模型对管理推理的影响：一项随机对照试验。

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.

作者信息

Goh Ethan, Gallo Robert, Strong Eric, Weng Yingjie, Kerman Hannah, Freed Jason, Cool Joséphine A, Kanjee Zahir, Lane Kathleen P, Parsons Andrew S, Ahuja Neera, Horvitz Eric, Yang Daniel, Milstein Arnold, Olson Andrew P J, Hom Jason, Chen Jonathan H, Rodman Adam

机构信息

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.

Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.

出版信息

medRxiv. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485.

DOI:10.1101/2024.08.05.24311485

PMID:39148822

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11326321/

Abstract

IMPORTANCE

Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.

OBJECTIVE

To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.

DESIGN

Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.

SETTING

Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.

PARTICIPANTS

92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.

INTERVENTION

Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.

MAIN OUTCOMES AND MEASURES

The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

RESULTS

Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).

CONCLUSIONS AND RELEVANCE

LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.

TRIAL REGISTRATION

ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.

摘要

重要性

大语言模型（LLM）人工智能（AI）系统在诊断推理方面已显示出前景，但其在无明确正确答案的管理推理中的效用尚不清楚。

目的

确定与传统资源相比，LLM辅助是否能提高医生在开放式管理推理任务中的表现。

设计

2023年11月30日至2024年4月21日进行的前瞻性随机对照试验。

设置

来自斯坦福大学、贝斯以色列女执事医疗中心和弗吉尼亚大学的多机构研究，涉及美国各地的医生。

参与者

92名在内科、家庭医学或急诊医学方面接受过培训的执业主治医师和住院医师。

干预措施

呈现五个由专家开发的临床病例 vignettes，附带多个开放式管理问题以及通过德尔菲法创建的评分标准。医生被随机分配，一组除使用传统资源（如UpToDate、谷歌）外，还通过ChatGPT Plus使用GPT-4，另一组仅使用传统资源。

主要结局和衡量指标

主要结局是两组在专家开发的评分标准上总分的差异。次要结局包括特定领域得分和每个病例花费的时间。

结果

与使用传统资源的医生相比，使用LLM的医生得分更高（平均差异6.5%，95%置信区间2.7 - 10.2，p<0.001）。在管理决策（6.1%，95%置信区间2.5 - 9.7，p = 0.001）、诊断决策（12.1%，95%置信区间3.1 - 21.0，p = 0.009）和特定病例领域（6.2%，95%置信区间2.4 - 9.9，p = 0.002）方面有显著改善。使用GPT-4的用户每个病例花费的时间更多（平均差异119.3秒，95%置信区间17.4 - 221.2，p = 0.02）。使用GPT-4增强的医生与仅使用GPT-4的医生之间没有显著差异（-0.9%，95%置信区间 - 9.0至7.2，p = 0.8）。

结论及相关性

与传统资源相比，LLM辅助改善了医生的管理推理，在情境和特定患者决策方面有特别的提升。这些发现表明LLMs可以增强复杂病例中的管理决策。

试验注册

ClinicalTrials.gov标识符：NCT06208423；https://classic.clinicaltrials.gov/ct2/show/NCT06208423 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/e5b64ea7bbfa/nihpp-2024.08.05.24311485v1-f0001.jpg

相似文献

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.

medRxiv. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485.

Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.

JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.

medRxiv. 2024 Mar 14:2024.03.12.24303785. doi: 10.1101/2024.03.12.24303785.

GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.

Nat Med. 2025 Apr;31(4):1233-1238. doi: 10.1038/s41591-024-03456-y. Epub 2025 Feb 5.

Consumer-providers of care for adult clients of statutory mental health services.

Cochrane Database Syst Rev. 2013 Mar 28;2013(3):CD004807. doi: 10.1002/14651858.CD004807.pub2.

Personalised care planning for adults with chronic or long-term health conditions.

Cochrane Database Syst Rev. 2015 Mar 3;2015(3):CD010523. doi: 10.1002/14651858.CD010523.pub2.

Interventions to improve hearing aid use in adult auditory rehabilitation.

Cochrane Database Syst Rev. 2016 Aug 18;2016(8):CD010342. doi: 10.1002/14651858.CD010342.pub3.

Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.

JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.

Negative pressure wound therapy for open traumatic wounds.

Cochrane Database Syst Rev. 2018 Jul 3;7(7):CD012522. doi: 10.1002/14651858.CD012522.pub2.

Shared decision-making for people with asthma.

Cochrane Database Syst Rev. 2017 Oct 3;10(10):CD012330. doi: 10.1002/14651858.CD012330.pub2.

本文引用的文献

Towards accurate differential diagnosis with large language models.

Nature. 2025 Apr 9. doi: 10.1038/s41586-025-08869-4.

Conversations on reasoning: Large language models in diagnosis.

J Hosp Med. 2024 Aug;19(8):731-735. doi: 10.1002/jhm.13378. Epub 2024 Apr 28.

The effect of using a large language model to respond to patient messages.

Lancet Digit Health. 2024 Jun;6(6):e379-e381. doi: 10.1016/S2589-7500(24)00060-8. Epub 2024 Apr 24.

Assessment of management reasoning: Design considerations drawn from analysis of simulated outpatient encounters.

Med Teach. 2025 Feb;47(2):218-232. doi: 10.1080/0142159X.2024.2337251. Epub 2024 Apr 16.

AI-Generated Draft Replies Integrated Into Health Records and Physicians' Electronic Communication.

JAMA Netw Open. 2024 Apr 1;7(4):e246565. doi: 10.1001/jamanetworkopen.2024.6565.

Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians.

JAMA Intern Med. 2024 May 1;184(5):581-583. doi: 10.1001/jamainternmed.2024.0295.

Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage?

Am J Emerg Med. 2024 May;79:44-47. doi: 10.1016/j.ajem.2024.02.008. Epub 2024 Feb 7.

A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports.

Sci Rep. 2023 Nov 17;13(1):20159. doi: 10.1038/s41598-023-47500-2.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.

JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型对管理推理的影响：一项随机对照试验。

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVE

DESIGN

SETTING

PARTICIPANTS

INTERVENTION

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

TRIAL REGISTRATION

重要性

目的

设计

设置

参与者

干预措施

主要结局和衡量指标

结果

结论及相关性

试验注册

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献