评估 ChatGPT 在整个临床工作流程中的效用：开发和可用性研究。

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

机构信息

Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States.

Harvard Medical School, Boston, MA, United States.

出版信息

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.

DOI:10.2196/48659

PMID:37606976

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10481210/

Abstract

BACKGROUND

Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated.

OBJECTIVE

This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.

METHODS

We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks.

RESULTS

ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types.

CONCLUSIONS

ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.

摘要

背景

基于大型语言模型（LLM）的人工智能聊天机器人将大型训练数据集的力量集中在连续的相关任务上，而不是人工智能已经取得令人印象深刻的性能的单一任务上。LLM 通过连续提示在迭代临床推理的各个方面提供帮助，实际上充当人工智能医生的能力尚未得到评估。

目的

本研究旨在通过其在标准化临床病例中的表现来评估 ChatGPT 在持续临床决策支持方面的能力。

方法

我们将默克手册（MSD）中的所有 36 个已发表的临床病例输入到 ChatGPT 中，并根据患者年龄、性别和病例严重程度比较其在鉴别诊断、诊断测试、最终诊断和管理方面的准确性。准确性通过人类评分者计算出的测试临床病例中提出的问题的正确答案比例来衡量。我们还进行了线性回归分析，以评估影响 ChatGPT 进行临床任务的因素。

结果

ChatGPT 在所有 36 个临床病例中的整体准确率为 71.7%（95%置信区间 69.3%-74.1%）。该语言模型在做出最终诊断方面表现最好，准确率为 76.9%（95%置信区间 67.8%-86.1%），在生成初始鉴别诊断方面表现最差，准确率为 60.3%（95%置信区间 54.2%-66.6%）。与回答一般医学知识问题相比，ChatGPT 在鉴别诊断（β=-15.8%；P<.001）和临床管理（β=-7.4%；P=.02）问题类型上的表现较差。

结论

ChatGPT 在临床决策制定方面取得了令人印象深刻的准确性，随着其获得更多的临床信息，其准确性也在不断提高。特别是，ChatGPT 在最终诊断任务中的准确性最高，而不是初始诊断。局限性包括可能的模型幻觉和 ChatGPT 的训练数据集的组成不清楚。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/3a1829727c2b/jmir_v25i1e48659_fig1.jpg

相似文献

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow.

medRxiv. 2023 Feb 26:2023.02.21.23285886. doi: 10.1101/2023.02.21.23285886.

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.

Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x.

Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.

JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.

Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

引用本文的文献

ChatGPT's performance in sample size estimation: a preliminary study on the capabilities of artificial intelligence.

Fam Pract. 2025 Aug 14;42(5). doi: 10.1093/fampra/cmaf069.

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.

J Med Internet Res. 2025 Aug 29;27:e64348. doi: 10.2196/64348.

Can ChatGPT Recognize Its Own Writing in Scientific Abstracts?

Cureus. 2025 Jul 25;17(7):e88774. doi: 10.7759/cureus.88774. eCollection 2025 Jul.

Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine.

J Multidiscip Healthc. 2025 Aug 12;18:4979-4988. doi: 10.2147/JMDH.S538253. eCollection 2025.

Performance of Microsoft Copilot in the Diagnostic Process of Pulmonary Embolism.

West J Emerg Med. 2025 Jul 13;26(4):1030-1039. doi: 10.5811/westjem.24995.

Postoperative complication management: How do large language models measure up to human expertise?

PLOS Digit Health. 2025 Aug 1;4(8):e0000933. doi: 10.1371/journal.pdig.0000933. eCollection 2025 Aug.

ChatGpt's accuracy in the diagnosis of oral lesions.

BMC Oral Health. 2025 Jul 21;25(1):1229. doi: 10.1186/s12903-025-06582-2.

Utilizing ChatGPT-3.5 to Assist Ophthalmologists in Clinical Decision-making.

J Ophthalmic Vis Res. 2025 May 5;20. doi: 10.18502/jovr.v20.14692. eCollection 2025.

Development and Evaluation of an Artificial Intelligence-Powered Surgical Oral Examination Simulator: A Pilot Study.

Mayo Clin Proc Digit Health. 2025 Jun 9;3(3):100241. doi: 10.1016/j.mcpdig.2025.100241. eCollection 2025 Sep.

Framework for bias evaluation in large language models in healthcare settings.

NPJ Digit Med. 2025 Jul 7;8(1):414. doi: 10.1038/s41746-025-01786-w.

本文引用的文献

The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.

Lancet Digit Health. 2024 Aug;6(8):e555-e561. doi: 10.1016/S2589-7500(24)00097-9.

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.

J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

A Collaborative Artificial Intelligence Annotation Platform Leveraging Blockchain For Medical Imaging Research.

Blockchain Healthc Today. 2021 Jun 22;4. doi: 10.30953/bhty.v4.176. eCollection 2021.

Nonhuman "Authors" and Implications for the Integrity of Scientific Publication and Medical Knowledge.

JAMA. 2023 Feb 28;329(8):637-639. doi: 10.1001/jama.2023.1344.

Prediction of oxygen requirement in patients with COVID-19 using a pre-trained chest radiograph xAI model: efficient development of auditable risk prediction models via a fine-tuning approach.

Sci Rep. 2022 Dec 7;12(1):21164. doi: 10.1038/s41598-022-24721-5.

Intubation and mortality prediction in hospitalized COVID-19 patients using a combination of convolutional neural network-based scoring of chest radiographs and clinical data.

BJR Open. 2022 Mar 24;4(1):20210062. doi: 10.1259/bjro.20210062. eCollection 2022.

Multi-population generalizability of a deep learning-based chest radiograph severity score for COVID-19.

Medicine (Baltimore). 2022 Jul 22;101(29):e29587. doi: 10.1097/MD.0000000000029587.

Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model.

Nat Commun. 2022 Apr 6;13(1):1867. doi: 10.1038/s41467-022-29437-8.

Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review.

JMIR Cancer. 2021 Nov 29;7(4):e27850. doi: 10.2196/27850.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估 ChatGPT 在整个临床工作流程中的效用：开发和可用性研究。

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献