Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States.
Harvard Medical School, Boston, MA, United States.
J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.
BACKGROUND: Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. OBJECTIVE: This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. METHODS: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks. RESULTS: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types. CONCLUSIONS: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.
背景:基于大型语言模型(LLM)的人工智能聊天机器人将大型训练数据集的力量集中在连续的相关任务上,而不是人工智能已经取得令人印象深刻的性能的单一任务上。LLM 通过连续提示在迭代临床推理的各个方面提供帮助,实际上充当人工智能医生的能力尚未得到评估。
目的:本研究旨在通过其在标准化临床病例中的表现来评估 ChatGPT 在持续临床决策支持方面的能力。
方法:我们将默克手册(MSD)中的所有 36 个已发表的临床病例输入到 ChatGPT 中,并根据患者年龄、性别和病例严重程度比较其在鉴别诊断、诊断测试、最终诊断和管理方面的准确性。准确性通过人类评分者计算出的测试临床病例中提出的问题的正确答案比例来衡量。我们还进行了线性回归分析,以评估影响 ChatGPT 进行临床任务的因素。
结果:ChatGPT 在所有 36 个临床病例中的整体准确率为 71.7%(95%置信区间 69.3%-74.1%)。该语言模型在做出最终诊断方面表现最好,准确率为 76.9%(95%置信区间 67.8%-86.1%),在生成初始鉴别诊断方面表现最差,准确率为 60.3%(95%置信区间 54.2%-66.6%)。与回答一般医学知识问题相比,ChatGPT 在鉴别诊断(β=-15.8%;P<.001)和临床管理(β=-7.4%;P=.02)问题类型上的表现较差。
结论:ChatGPT 在临床决策制定方面取得了令人印象深刻的准确性,随着其获得更多的临床信息,其准确性也在不断提高。特别是,ChatGPT 在最终诊断任务中的准确性最高,而不是初始诊断。局限性包括可能的模型幻觉和 ChatGPT 的训练数据集的组成不清楚。
J Med Internet Res. 2023-8-22
Cureus. 2025-7-25
J Multidiscip Healthc. 2025-8-12
West J Emerg Med. 2025-7-13
PLOS Digit Health. 2025-8-1
BMC Oral Health. 2025-7-21
J Ophthalmic Vis Res. 2025-5-5
Mayo Clin Proc Digit Health. 2025-6-9
NPJ Digit Med. 2025-7-7
Lancet Digit Health. 2024-8
PLOS Digit Health. 2023-2-9
Blockchain Healthc Today. 2021-6-22
Medicine (Baltimore). 2022-7-22