• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型对管理推理的影响:一项随机对照试验。

Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.

作者信息

Goh Ethan, Gallo Robert, Strong Eric, Weng Yingjie, Kerman Hannah, Freed Jason, Cool Joséphine A, Kanjee Zahir, Lane Kathleen P, Parsons Andrew S, Ahuja Neera, Horvitz Eric, Yang Daniel, Milstein Arnold, Olson Andrew P J, Hom Jason, Chen Jonathan H, Rodman Adam

机构信息

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA.

Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA.

出版信息

medRxiv. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485.

DOI:10.1101/2024.08.05.24311485
PMID:39148822
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11326321/
Abstract

IMPORTANCE

Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown.

OBJECTIVE

To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources.

DESIGN

Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024.

SETTING

Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States.

PARTICIPANTS

92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine.

INTERVENTION

Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone.

MAIN OUTCOMES AND MEASURES

The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

RESULTS

Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8).

CONCLUSIONS AND RELEVANCE

LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases.

TRIAL REGISTRATION

ClinicalTrials.gov Identifier: NCT06208423; https://classic.clinicaltrials.gov/ct2/show/NCT06208423.

摘要

重要性

大语言模型(LLM)人工智能(AI)系统在诊断推理方面已显示出前景,但其在无明确正确答案的管理推理中的效用尚不清楚。

目的

确定与传统资源相比,LLM辅助是否能提高医生在开放式管理推理任务中的表现。

设计

2023年11月30日至2024年4月21日进行的前瞻性随机对照试验。

设置

来自斯坦福大学、贝斯以色列女执事医疗中心和弗吉尼亚大学的多机构研究,涉及美国各地的医生。

参与者

92名在内科、家庭医学或急诊医学方面接受过培训的执业主治医师和住院医师。

干预措施

呈现五个由专家开发的临床病例 vignettes,附带多个开放式管理问题以及通过德尔菲法创建的评分标准。医生被随机分配,一组除使用传统资源(如UpToDate、谷歌)外,还通过ChatGPT Plus使用GPT-4,另一组仅使用传统资源。

主要结局和衡量指标

主要结局是两组在专家开发的评分标准上总分的差异。次要结局包括特定领域得分和每个病例花费的时间。

结果

与使用传统资源的医生相比,使用LLM的医生得分更高(平均差异6.5%,95%置信区间2.7 - 10.2,p<0.001)。在管理决策(6.1%,95%置信区间2.5 - 9.7,p = 0.001)、诊断决策(12.1%,95%置信区间3.1 - 21.0,p = 0.009)和特定病例领域(6.2%,95%置信区间2.4 - 9.9,p = 0.002)方面有显著改善。使用GPT-4的用户每个病例花费的时间更多(平均差异119.3秒,95%置信区间17.4 - 221.2,p = 0.02)。使用GPT-4增强的医生与仅使用GPT-4的医生之间没有显著差异(-0.9%,95%置信区间 - 9.0至7.2,p = 0.8)。

结论及相关性

与传统资源相比,LLM辅助改善了医生的管理推理,在情境和特定患者决策方面有特别的提升。这些发现表明LLMs可以增强复杂病例中的管理决策。

试验注册

ClinicalTrials.gov标识符:NCT06208423;https://classic.clinicaltrials.gov/ct2/show/NCT06208423 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/928800a2176e/nihpp-2024.08.05.24311485v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/e5b64ea7bbfa/nihpp-2024.08.05.24311485v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/928800a2176e/nihpp-2024.08.05.24311485v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/e5b64ea7bbfa/nihpp-2024.08.05.24311485v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f32a/11326321/928800a2176e/nihpp-2024.08.05.24311485v1-f0002.jpg

相似文献

1
Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial.大语言模型对管理推理的影响:一项随机对照试验。
medRxiv. 2024 Aug 7:2024.08.05.24311485. doi: 10.1101/2024.08.05.24311485.
2
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
3
Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.大语言模型对诊断推理的影响:一项随机临床病例研究
medRxiv. 2024 Mar 14:2024.03.12.24303785. doi: 10.1101/2024.03.12.24303785.
4
GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial.GPT-4辅助改善医生在患者护理任务中的表现:一项随机对照试验。
Nat Med. 2025 Apr;31(4):1233-1238. doi: 10.1038/s41591-024-03456-y. Epub 2025 Feb 5.
5
Consumer-providers of care for adult clients of statutory mental health services.法定心理健康服务成年客户的护理消费者提供者。
Cochrane Database Syst Rev. 2013 Mar 28;2013(3):CD004807. doi: 10.1002/14651858.CD004807.pub2.
6
Personalised care planning for adults with chronic or long-term health conditions.为患有慢性或长期健康问题的成年人制定个性化护理计划。
Cochrane Database Syst Rev. 2015 Mar 3;2015(3):CD010523. doi: 10.1002/14651858.CD010523.pub2.
7
Interventions to improve hearing aid use in adult auditory rehabilitation.改善成人听觉康复中助听器使用情况的干预措施。
Cochrane Database Syst Rev. 2016 Aug 18;2016(8):CD010342. doi: 10.1002/14651858.CD010342.pub3.
8
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较:随机对照试验
JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.
9
Negative pressure wound therapy for open traumatic wounds.开放性创伤伤口的负压伤口治疗
Cochrane Database Syst Rev. 2018 Jul 3;7(7):CD012522. doi: 10.1002/14651858.CD012522.pub2.
10
Shared decision-making for people with asthma.哮喘患者的共同决策
Cochrane Database Syst Rev. 2017 Oct 3;10(10):CD012330. doi: 10.1002/14651858.CD012330.pub2.

本文引用的文献

1
Towards accurate differential diagnosis with large language models.迈向使用大语言模型进行准确的鉴别诊断。
Nature. 2025 Apr 9. doi: 10.1038/s41586-025-08869-4.
2
Conversations on reasoning: Large language models in diagnosis.关于推理的对话:诊断中的大语言模型
J Hosp Med. 2024 Aug;19(8):731-735. doi: 10.1002/jhm.13378. Epub 2024 Apr 28.
3
The effect of using a large language model to respond to patient messages.使用大语言模型回复患者信息的效果。
Lancet Digit Health. 2024 Jun;6(6):e379-e381. doi: 10.1016/S2589-7500(24)00060-8. Epub 2024 Apr 24.
4
Assessment of management reasoning: Design considerations drawn from analysis of simulated outpatient encounters.管理推理评估:基于模拟门诊诊疗分析得出的设计考量
Med Teach. 2025 Feb;47(2):218-232. doi: 10.1080/0142159X.2024.2337251. Epub 2024 Apr 16.
5
AI-Generated Draft Replies Integrated Into Health Records and Physicians' Electronic Communication.人工智能生成的草稿回复整合到健康记录和医生的电子通信中。
JAMA Netw Open. 2024 Apr 1;7(4):e246565. doi: 10.1001/jamanetworkopen.2024.6565.
6
Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians.生成式人工智能模型与医生的临床推理比较
JAMA Intern Med. 2024 May 1;184(5):581-583. doi: 10.1001/jamainternmed.2024.0295.
7
Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage?人类智能与Chat-GPT:在分诊中对患者进行正确分类时谁表现得更好?
Am J Emerg Med. 2024 May;79:44-47. doi: 10.1016/j.ajem.2024.02.008. Epub 2024 Feb 7.
8
A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports.GPT-4 在提供 MRI 报告中的骨科治疗建议方面的功效的初步研究。
Sci Rep. 2023 Nov 17;13(1):20159. doi: 10.1038/s41598-023-47500-2.
9
Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.医学专业人员的新兴技能:提示工程教程
J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.
10
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.