人工智能临床医生的未来：评估现代聊天机器人的标准及其对诊断不确定性的处理方法。

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.

机构信息

Temerty Faculty of Medicine, University of Toronto, Health Centre at 80 Bond, St. Michael's Hospital, 80 Bond Street, Toronto, ON, M5B1X2, Canada.

Department of Medicine, Royal College of Surgeons in Ireland, Dublin, Leinster, Ireland.

出版信息

BMC Med Educ. 2024 Oct 11;24(1):1133. doi: 10.1186/s12909-024-06115-5.

DOI:10.1186/s12909-024-06115-5

PMID:39394122

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11470580/

Abstract

BACKGROUND

Artificial intelligence (AI) chatbots have demonstrated proficiency in structured knowledge assessments; however, there is limited research on their performance in scenarios involving diagnostic uncertainty, which requires careful interpretation and complex decision-making. This study aims to evaluate the efficacy of AI chatbots, GPT-4o and Claude-3, in addressing medical scenarios characterized by diagnostic uncertainty relative to Family Medicine residents.

METHODS

Questions with diagnostic uncertainty were extracted from the Progress Tests administered by the Department of Family and Community Medicine at the University of Toronto between 2022 and 2023. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced diagnostic reasoning and differential diagnosis. These questions were administered to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years and inputted into GPT-4o and Claude-3. Errors were categorized into statistical, information, and logical errors. Statistical analyses were conducted using a binomial generalized estimating equation model, paired t-tests, and chi-squared tests.

RESULTS

Compared to the residents, both chatbots scored lower on diagnostic uncertainty questions (p < 0.01). PGY-1 residents achieved a correctness rate of 61.1% (95% CI: 58.4-63.7), and PGY-2 residents achieved 63.3% (95% CI: 60.7-66.1). In contrast, Claude-3 correctly answered 57.7% (n = 52/90) of questions, and GPT-4o correctly answered 53.3% (n = 48/90). Claude-3 had a longer mean response time (24.0 s, 95% CI: 21.0-32.5 vs. 12.4 s, 95% CI: 9.3-15.3; p < 0.01) and produced longer answers (2001 characters, 95% CI: 1845-2212 vs. 1596 characters, 95% CI: 1395-1705; p < 0.01) compared to GPT-4o. Most errors by GPT-4o were logical errors (62.5%).

CONCLUSIONS

While AI chatbots like GPT-4o and Claude-3 demonstrate potential in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents.

摘要

背景

人工智能 (AI) 聊天机器人在结构化知识评估方面表现出色；然而，关于它们在涉及诊断不确定性的情况下的表现的研究有限，这需要仔细解释和复杂的决策。本研究旨在评估 AI 聊天机器人 GPT-4o 和 Claude-3 在处理具有诊断不确定性的医学场景方面的功效，与家庭医学住院医师相比。

方法

从多伦多大学家庭医学系在 2022 年至 2023 年期间进行的进展测试中提取具有诊断不确定性的问题。诊断不确定性问题被定义为那些呈现出临床症状、临床发现和患者病史不一致的临床情况，需要细致的诊断推理和鉴别诊断。这些问题被提交给一个由 320 名家庭医学住院医师组成的队列，他们在第一 (PGY-1) 和第二 (PGY-2) 研究生阶段，输入 GPT-4o 和 Claude-3。错误被分为统计、信息和逻辑错误。使用二项式广义估计方程模型、配对 t 检验和卡方检验进行统计分析。

结果

与住院医师相比，两个聊天机器人在诊断不确定性问题上的得分都较低 (p<0.01)。PGY-1 住院医师的正确率为 61.1% (95% CI: 58.4-63.7)，PGY-2 住院医师的正确率为 63.3% (95% CI: 60.7-66.1)。相比之下，Claude-3 正确回答了 90 个问题中的 57.7% (n=52/90)，GPT-4o 正确回答了 90 个问题中的 53.3% (n=48/90)。Claude-3 的平均响应时间更长 (24.0 秒，95% CI: 21.0-32.5 与 12.4 秒，95% CI: 9.3-15.3；p<0.01)，生成的答案更长 (2001 个字符，95% CI: 1845-2212 与 1596 个字符，95% CI: 1395-1705；p<0.01)，与 GPT-4o 相比。GPT-4o 的大多数错误都是逻辑错误 (62.5%)。

结论

虽然像 GPT-4o 和 Claude-3 这样的 AI 聊天机器人在处理结构化医学知识方面表现出了潜力，但它们在涉及诊断不确定性的情况下的表现仍然不如人类住院医师。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb1e/11470580/6fc1d2bc4b91/12909_2024_6115_Fig1_HTML.jpg

相似文献

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.人工智能临床医生的未来：评估现代聊天机器人的标准及其对诊断不确定性的处理方法。

BMC Med Educ. 2024 Oct 11;24(1):1133. doi: 10.1186/s12909-024-06115-5.

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估：比较研究

JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.

Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.评估不同人工智能聊天机器人对口腔颌面外科基于临床决策案例问题的人工智能生成回复。

Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 Jun;137(6):587-593. doi: 10.1016/j.oooo.2024.02.018. Epub 2024 Mar 6.

GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.GPT-4o 在回答模拟的欧洲介入放射学委员会考试方面的能力与德国医学生和专家相比，以及其在介入放射学方面生成考试项目的能力：一项描述性研究。

J Educ Eval Health Prof. 2024;21:21. doi: 10.3352/jeehp.2024.21.21. Epub 2024 Aug 20.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.GPT-4o和Claude 3 Opus根据病史和尸检CT结果确定死因的诊断性能

Cureus. 2024 Aug 20;16(8):e67306. doi: 10.7759/cureus.67306. eCollection 2024 Aug.

Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study.人工智能驱动的聊天机器人在回答骨科研究生考试问题中的有效性——一项观察性研究。

Int Orthop. 2024 Aug;48(8):1963-1969. doi: 10.1007/s00264-024-06182-9. Epub 2024 Apr 15.

Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.评估人工智能在医学知识测试中的效能：一项使用2020年至2023年台湾内科医师考试试题的研究。

Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。

Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

引用本文的文献

The future of pharmaceuticals: Artificial intelligence in drug discovery and development.制药的未来：药物研发中的人工智能

J Pharm Anal. 2025 Aug;15(8):101248. doi: 10.1016/j.jpha.2025.101248. Epub 2025 Feb 26.

Addressing the challenges of field notes in medical education: a qualitative study of resident experiences.应对医学教育中实地记录的挑战：一项关于住院医师经历的定性研究

BMC Med Educ. 2025 Jul 1;25(1):883. doi: 10.1186/s12909-025-07578-w.

Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study.使用具有新型三机器人评估系统的人工智能驱动评估推进临床聊天机器人验证：工具验证研究

JMIR Nurs. 2025 Feb 27;8:e63058. doi: 10.2196/63058.

Evidence-Based Analysis of AI Chatbots in Oncology Patient Education: Implications for Trust, Perceived Realness, and Misinformation Management.肿瘤患者教育中人工智能聊天机器人的循证分析：对信任、感知真实性和错误信息管理的影响

J Cancer Educ. 2025 Feb 18. doi: 10.1007/s13187-025-02592-4.

The Role of ChatGPT and AI Chatbots in Optimizing Antibiotic Therapy: A Comprehensive Narrative Review.ChatGPT和人工智能聊天机器人在优化抗生素治疗中的作用：一项全面的叙述性综述

Antibiotics (Basel). 2025 Jan 9;14(1):60. doi: 10.3390/antibiotics14010060.

本文引用的文献

Artificial intelligence-based extraction of quantitative ultra-widefield fluorescein angiography parameters in retinal vein occlusion.基于人工智能提取视网膜静脉阻塞中定量超广角荧光素血管造影参数

Can J Ophthalmol. 2025 Jun;60(3):177-185. doi: 10.1016/j.jcjo.2024.08.002. Epub 2024 Aug 31.

ARTIFICIAL INTELLIGENCE-ENHANCED ANALYSIS OF RETINAL VASCULATURE IN AGE-RELATED MACULAR DEGENERATION.人工智能增强分析与年龄相关性黄斑变性相关的视网膜血管。

Retina. 2024 Sep 1;44(9):1478-1485. doi: 10.1097/IAE.0000000000004159.

Adherence of studies involving artificial intelligence in the analysis of ophthalmology electronic medical records to AI-specific items from the CONSORT-AI guideline: a systematic review.分析眼科电子病历中人工智能的研究对 CONSORT-AI 指南中人工智能特定项目的依从性：系统评价。

Graefes Arch Clin Exp Ophthalmol. 2024 Dec;262(12):3741-3748. doi: 10.1007/s00417-024-06553-3. Epub 2024 Jul 2.

Humanism in Canadian medicine: from the Rockies to the Atlantic.加拿大医学中的人文主义：从落基山脉到大西洋

Can Med Educ J. 2024 May 1;15(2):97-98. doi: 10.36834/cmej.78391. eCollection 2024 May.

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.ChatGPT 在台湾泌尿科考试中的表现：洞察当前的优势和不足。

World J Urol. 2024 Apr 23;42(1):250. doi: 10.1007/s00345-024-04957-8.

A study of ChatGPT in facilitating Heart Team decisions on severe aortic stenosis.使用 ChatGPT 辅助心脏团队对严重主动脉瓣狭窄做出决策的研究。

EuroIntervention. 2024 Apr 15;20(8):e496-e503. doi: 10.4244/EIJ-D-23-00643.

Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.谷歌 Gemini 和巴德人工智能聊天机器人在眼科知识评估中的表现。

Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13.

The Comparative Diagnostic Capability of Large Language Models in Otolaryngology.大语言模型在耳鼻喉科的比较诊断能力

Laryngoscope. 2024 Sep;134(9):3997-4002. doi: 10.1002/lary.31434. Epub 2024 Apr 2.

Chatbot and Academy Preferred Practice Pattern Guidelines on Retinal Diseases.视网膜疾病的聊天机器人与学会首选实践模式指南

Ophthalmol Retina. 2024 Jul;8(7):723-725. doi: 10.1016/j.oret.2024.03.013. Epub 2024 Mar 17.

Artificial intelligence chatbot and Academy Preferred Practice Pattern ® Guidelines on cataract and glaucoma.人工智能聊天机器人与白内障和青光眼的学会首选实践模式指南

J Cataract Refract Surg. 2024 May 1;50(5):534-535. doi: 10.1097/j.jcrs.0000000000001317. Epub 2024 Mar 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能临床医生的未来：评估现代聊天机器人的标准及其对诊断不确定性的处理方法。

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献