Temerty Faculty of Medicine, University of Toronto, Health Centre at 80 Bond, St. Michael's Hospital, 80 Bond Street, Toronto, ON, M5B1X2, Canada.
Department of Medicine, Royal College of Surgeons in Ireland, Dublin, Leinster, Ireland.
BMC Med Educ. 2024 Oct 11;24(1):1133. doi: 10.1186/s12909-024-06115-5.
Artificial intelligence (AI) chatbots have demonstrated proficiency in structured knowledge assessments; however, there is limited research on their performance in scenarios involving diagnostic uncertainty, which requires careful interpretation and complex decision-making. This study aims to evaluate the efficacy of AI chatbots, GPT-4o and Claude-3, in addressing medical scenarios characterized by diagnostic uncertainty relative to Family Medicine residents.
Questions with diagnostic uncertainty were extracted from the Progress Tests administered by the Department of Family and Community Medicine at the University of Toronto between 2022 and 2023. Diagnostic uncertainty questions were defined as those presenting clinical scenarios where symptoms, clinical findings, and patient histories do not converge on a definitive diagnosis, necessitating nuanced diagnostic reasoning and differential diagnosis. These questions were administered to a cohort of 320 Family Medicine residents in their first (PGY-1) and second (PGY-2) postgraduate years and inputted into GPT-4o and Claude-3. Errors were categorized into statistical, information, and logical errors. Statistical analyses were conducted using a binomial generalized estimating equation model, paired t-tests, and chi-squared tests.
Compared to the residents, both chatbots scored lower on diagnostic uncertainty questions (p < 0.01). PGY-1 residents achieved a correctness rate of 61.1% (95% CI: 58.4-63.7), and PGY-2 residents achieved 63.3% (95% CI: 60.7-66.1). In contrast, Claude-3 correctly answered 57.7% (n = 52/90) of questions, and GPT-4o correctly answered 53.3% (n = 48/90). Claude-3 had a longer mean response time (24.0 s, 95% CI: 21.0-32.5 vs. 12.4 s, 95% CI: 9.3-15.3; p < 0.01) and produced longer answers (2001 characters, 95% CI: 1845-2212 vs. 1596 characters, 95% CI: 1395-1705; p < 0.01) compared to GPT-4o. Most errors by GPT-4o were logical errors (62.5%).
While AI chatbots like GPT-4o and Claude-3 demonstrate potential in handling structured medical knowledge, their performance in scenarios involving diagnostic uncertainty remains suboptimal compared to human residents.
人工智能 (AI) 聊天机器人在结构化知识评估方面表现出色;然而,关于它们在涉及诊断不确定性的情况下的表现的研究有限,这需要仔细解释和复杂的决策。本研究旨在评估 AI 聊天机器人 GPT-4o 和 Claude-3 在处理具有诊断不确定性的医学场景方面的功效,与家庭医学住院医师相比。
从多伦多大学家庭医学系在 2022 年至 2023 年期间进行的进展测试中提取具有诊断不确定性的问题。诊断不确定性问题被定义为那些呈现出临床症状、临床发现和患者病史不一致的临床情况,需要细致的诊断推理和鉴别诊断。这些问题被提交给一个由 320 名家庭医学住院医师组成的队列,他们在第一 (PGY-1) 和第二 (PGY-2) 研究生阶段,输入 GPT-4o 和 Claude-3。错误被分为统计、信息和逻辑错误。使用二项式广义估计方程模型、配对 t 检验和卡方检验进行统计分析。
与住院医师相比,两个聊天机器人在诊断不确定性问题上的得分都较低 (p<0.01)。PGY-1 住院医师的正确率为 61.1% (95% CI: 58.4-63.7),PGY-2 住院医师的正确率为 63.3% (95% CI: 60.7-66.1)。相比之下,Claude-3 正确回答了 90 个问题中的 57.7% (n=52/90),GPT-4o 正确回答了 90 个问题中的 53.3% (n=48/90)。Claude-3 的平均响应时间更长 (24.0 秒,95% CI: 21.0-32.5 与 12.4 秒,95% CI: 9.3-15.3;p<0.01),生成的答案更长 (2001 个字符,95% CI: 1845-2212 与 1596 个字符,95% CI: 1395-1705;p<0.01),与 GPT-4o 相比。GPT-4o 的大多数错误都是逻辑错误 (62.5%)。
虽然像 GPT-4o 和 Claude-3 这样的 AI 聊天机器人在处理结构化医学知识方面表现出了潜力,但它们在涉及诊断不确定性的情况下的表现仍然不如人类住院医师。