文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

评估 ChatGPT 在整个临床工作流程中的效用:开发和可用性研究。

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

机构信息

Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States.

Harvard Medical School, Boston, MA, United States.

出版信息

J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659.


DOI:10.2196/48659
PMID:37606976
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10481210/
Abstract

BACKGROUND: Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. OBJECTIVE: This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. METHODS: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks. RESULTS: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types. CONCLUSIONS: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.

摘要

背景:基于大型语言模型(LLM)的人工智能聊天机器人将大型训练数据集的力量集中在连续的相关任务上,而不是人工智能已经取得令人印象深刻的性能的单一任务上。LLM 通过连续提示在迭代临床推理的各个方面提供帮助,实际上充当人工智能医生的能力尚未得到评估。

目的:本研究旨在通过其在标准化临床病例中的表现来评估 ChatGPT 在持续临床决策支持方面的能力。

方法:我们将默克手册(MSD)中的所有 36 个已发表的临床病例输入到 ChatGPT 中,并根据患者年龄、性别和病例严重程度比较其在鉴别诊断、诊断测试、最终诊断和管理方面的准确性。准确性通过人类评分者计算出的测试临床病例中提出的问题的正确答案比例来衡量。我们还进行了线性回归分析,以评估影响 ChatGPT 进行临床任务的因素。

结果:ChatGPT 在所有 36 个临床病例中的整体准确率为 71.7%(95%置信区间 69.3%-74.1%)。该语言模型在做出最终诊断方面表现最好,准确率为 76.9%(95%置信区间 67.8%-86.1%),在生成初始鉴别诊断方面表现最差,准确率为 60.3%(95%置信区间 54.2%-66.6%)。与回答一般医学知识问题相比,ChatGPT 在鉴别诊断(β=-15.8%;P<.001)和临床管理(β=-7.4%;P=.02)问题类型上的表现较差。

结论:ChatGPT 在临床决策制定方面取得了令人印象深刻的准确性,随着其获得更多的临床信息,其准确性也在不断提高。特别是,ChatGPT 在最终诊断任务中的准确性最高,而不是初始诊断。局限性包括可能的模型幻觉和 ChatGPT 的训练数据集的组成不清楚。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/ee74909d76f7/jmir_v25i1e48659_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/3a1829727c2b/jmir_v25i1e48659_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/ee74909d76f7/jmir_v25i1e48659_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/3a1829727c2b/jmir_v25i1e48659_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6d8/10481210/ee74909d76f7/jmir_v25i1e48659_fig2.jpg

相似文献

[1]
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study.

J Med Internet Res. 2023-8-22

[2]
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow.

medRxiv. 2023-2-26

[3]
Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.

Sci Rep. 2024-4-23

[4]
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.

JMIR Form Res. 2024-10-1

[5]
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023-2-8

[6]
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.

JMIR Med Educ. 2024-2-9

[7]
ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.

JMIR Med Inform. 2023-10-9

[8]
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.

J Med Internet Res. 2024-6-14

[9]
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.

Int J Med Inform. 2023-9

[10]
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.

Int J Nurs Stud. 2024-5

引用本文的文献

[1]
ChatGPT's performance in sample size estimation: a preliminary study on the capabilities of artificial intelligence.

Fam Pract. 2025-8-14

[2]
Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study.

J Med Internet Res. 2025-8-29

[3]
Can ChatGPT Recognize Its Own Writing in Scientific Abstracts?

Cureus. 2025-7-25

[4]
Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine.

J Multidiscip Healthc. 2025-8-12

[5]
Performance of Microsoft Copilot in the Diagnostic Process of Pulmonary Embolism.

West J Emerg Med. 2025-7-13

[6]
Postoperative complication management: How do large language models measure up to human expertise?

PLOS Digit Health. 2025-8-1

[7]
ChatGpt's accuracy in the diagnosis of oral lesions.

BMC Oral Health. 2025-7-21

[8]
Utilizing ChatGPT-3.5 to Assist Ophthalmologists in Clinical Decision-making.

J Ophthalmic Vis Res. 2025-5-5

[9]
Development and Evaluation of an Artificial Intelligence-Powered Surgical Oral Examination Simulator: A Pilot Study.

Mayo Clin Proc Digit Health. 2025-6-9

[10]
Framework for bias evaluation in large language models in healthcare settings.

NPJ Digit Med. 2025-7-7

本文引用的文献

[1]
The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study.

Lancet Digit Health. 2024-8

[2]
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.

J Am Coll Radiol. 2023-10

[3]
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023-2-9

[4]
A Collaborative Artificial Intelligence Annotation Platform Leveraging Blockchain For Medical Imaging Research.

Blockchain Healthc Today. 2021-6-22

[5]
Nonhuman "Authors" and Implications for the Integrity of Scientific Publication and Medical Knowledge.

JAMA. 2023-2-28

[6]
Prediction of oxygen requirement in patients with COVID-19 using a pre-trained chest radiograph xAI model: efficient development of auditable risk prediction models via a fine-tuning approach.

Sci Rep. 2022-12-7

[7]
Intubation and mortality prediction in hospitalized COVID-19 patients using a combination of convolutional neural network-based scoring of chest radiographs and clinical data.

BJR Open. 2022-3-24

[8]
Multi-population generalizability of a deep learning-based chest radiograph severity score for COVID-19.

Medicine (Baltimore). 2022-7-22

[9]
Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model.

Nat Commun. 2022-4-6

[10]
Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review.

JMIR Cancer. 2021-11-29

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索