多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估：比较研究

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.

作者信息

Huang Ryan St, Lu Kevin Jia Qi, Meaney Christopher, Kemppainen Joel, Punnett Angela, Leung Fok-Han

机构信息

Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.

Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada.

出版信息

JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.

DOI:10.2196/50514

PMID:37725411

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10548315/

Abstract

BACKGROUND

Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools.

OBJECTIVE

This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident.

METHODS

An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test.

RESULTS

GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001).

CONCLUSIONS

GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.

摘要

背景

随着ChatGPT（特别是GPT-3.5）及其后续版本GPT-4的发布，基于大语言模型（LLM）的聊天机器人正以前所未有的速度发展。它们在通用任务和语言生成方面的能力已经提升到在各种教育考试基准测试中表现出色的程度，包括医学知识测试。在多项选择题医学知识测试中，将这两种大语言模型的表现与家庭医学住院医师的表现进行比较，可以深入了解它们作为医学教育工具的潜力。

目的

本研究旨在定量和定性地比较GPT-3.5、GPT-4和家庭医学住院医师在适合家庭医学住院医师水平的多项选择题医学知识测试中的表现。

方法

将由多伦多大学家庭与社区医学系官方提供的包含多项选择题的进展测试输入到GPT-3.5和GPT-4中。人工审核人工智能聊天机器人的回答，以确定所选答案、回答长度、回答时间、为输出回答提供的理由，以及所有错误回答的根本原因（分为算术错误、逻辑错误和信息错误）。将人工智能聊天机器人的表现与同时参加该测试的一组家庭医学住院医师的表现进行比较。

结果

与GPT-3.5相比，GPT-4的表现显著更好（差异25.0%，95%置信区间16.3%-32.8%；McNemar检验：P<.001）；它正确回答了89/108（82.4%）的问题，而GPT-3.5正确回答了62/108（57.4%）的问题。此外，GPT-4在家庭医学知识的所有11个类别中得分更高。在86.1%（n=93）的回答中，GPT-4提供了不选择其他多项选择题选项的理由，而GPT-3.5为16.7%（n=18）的回答提供了理由。定性地说，对于GPT-3.5和GPT-4的回答，逻辑错误最为常见，而算术错误最不常见。家庭医学住院医师的平均表现为56.9%（95%置信区间56.2%-57.6%）。GPT-3.5的表现与家庭医学住院医师的平均表现相似（P=.16），而GPT-4的表现超过了表现最佳的家庭医学住院医师（P<.001）。

结论

在为家庭医学住院医师设计的多项选择题医学知识测试中，GPT-4的表现显著优于GPT-3.5和家庭医学住院医师。GPT-4为其回答选择提供了逻辑依据，高效且简洁地排除了其他答案选项。其高度的准确性和先进的推理能力促进了其在医学教育中的潜在应用，包括创建考试题目和场景，以及作为医学知识或社区服务信息的资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/91a2/10548315/fa542e03d785/mededu_v9i1e50514_fig1.jpg

相似文献

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估：比较研究

JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study.ChatGPT-4与日本内科住院医师在普通内科培训考试中的表现比较：比较研究

JMIR Med Educ. 2023 Dec 6;9:e52202. doi: 10.2196/52202.

The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty.人工智能临床医生的未来：评估现代聊天机器人的标准及其对诊断不确定性的处理方法。

BMC Med Educ. 2024 Oct 11;24(1):1133. doi: 10.1186/s12909-024-06115-5.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度：混合方法研究。

J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。

World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V（视觉）在日本国家医师资格考试中的能力：评估研究。

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。

J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.

引用本文的文献

A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations.关于ChatGPT在医学住院医师考试中表现的多模型纵向评估。

Front Artif Intell. 2025 Aug 22;8:1614874. doi: 10.3389/frai.2025.1614874. eCollection 2025.

Addressing the challenges of field notes in medical education: a qualitative study of resident experiences.应对医学教育中实地记录的挑战：一项关于住院医师经历的定性研究

BMC Med Educ. 2025 Jul 1;25(1):883. doi: 10.1186/s12909-025-07578-w.

Performance of ChatGPT on the Plastic Surgery In-Training Examination.ChatGPT在整形外科住院医师培训考试中的表现。

Eplasty. 2024 Dec 18;24:e68. eCollection 2024.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Artificial Intelligence Augmentation: Performance of GPT-4 and GPT-3.5 on the Plastic Surgery In-service Examination.人工智能增强：GPT-4和GPT-3.5在整形外科在职考试中的表现。

Plast Reconstr Surg Glob Open. 2025 Apr 10;13(4):e6645. doi: 10.1097/GOX.0000000000006645. eCollection 2025 Apr.

ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review.医学教育中的ChatGPT及其他大语言模型——文献综述

Med Sci Educ. 2024 Nov 13;35(1):555-567. doi: 10.1007/s40670-024-02206-6. eCollection 2025 Feb.

Emergency Medicine Assistants in the Field of Toxicology, Comparison of ChatGPT-3.5 and GEMINI Artificial Intelligence Systems.毒理学领域的急诊医学助手，ChatGPT-3.5与双子座人工智能系统的比较

Acta Med Litu. 2024;31(2):294-301. doi: 10.15388/Amed.2024.31.2.18. Epub 2024 Dec 4.

ChatGPT's Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini.ChatGPT在葡萄牙语医学考试问题上的表现：ChatGPT-3.5 Turbo与ChatGPT-4o Mini的比较分析。

JMIR Med Educ. 2025 Mar 5;11:e65108. doi: 10.2196/65108.

Analyzing evaluation methods for large language models in the medical field: a scoping review.分析医学领域大语言模型的评价方法：范围综述。

BMC Med Inform Decis Mak. 2024 Nov 29;24(1):366. doi: 10.1186/s12911-024-02709-7.

Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google Bard with Iranian students in pre-internship comprehensive exams.将 ChatGPT-3.5-Turbo、ChatGPT-4 和 Google Bard 与伊朗医学生在实习前综合考试中的表现进行比较。

Sci Rep. 2024 Nov 18;14(1):28456. doi: 10.1038/s41598-024-79335-w.

本文引用的文献

GPT is an effective tool for multilingual psychological text analysis.GPT 是一种用于多语言心理文本分析的有效工具。

Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2308950121. doi: 10.1073/pnas.2308950121. Epub 2024 Aug 12.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

ChatGPT - Reshaping medical education and clinical management.ChatGPT——重塑医学教育与临床管理。

Pak J Med Sci. 2023 Mar-Apr;39(2):605-607. doi: 10.12669/pjms.39.2.7653.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.ChatGPT中的人工幻觉：对科学写作的影响

Cureus. 2023 Feb 19;15(2):e35179. doi: 10.7759/cureus.35179. eCollection 2023 Feb.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

What makes family doctors the leaders we need in health care?是什么让家庭医生成为我们医疗保健领域所需的领导者？

Can Fam Physician. 2022 Nov;68(11):801-802. doi: 10.46747/cfp.6811801.

Family medicine resident OSCEs: a systematic review.家庭医学住院医师客观结构化临床考试：一项系统评价。

Educ Prim Care. 2016 Nov;27(6):471-477. doi: 10.1080/14739879.2016.1205835. Epub 2016 Jul 13.

Guidelines for the construction of multiple choice questions tests.多项选择题测试构建指南。

J Family Community Med. 2006 Sep;13(3):125-33.

Simple improved confidence intervals for comparing matched proportions.用于比较匹配比例的简单改进置信区间。

Stat Med. 2005 Mar 15;24(5):729-40. doi: 10.1002/sim.1781.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估：比较研究

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献