Huang Ryan St, Lu Kevin Jia Qi, Meaney Christopher, Kemppainen Joel, Punnett Angela, Leung Fok-Han
Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada.
JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.
Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools.
This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident.
An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test.
GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001).
GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.
随着ChatGPT(特别是GPT-3.5)及其后续版本GPT-4的发布,基于大语言模型(LLM)的聊天机器人正以前所未有的速度发展。它们在通用任务和语言生成方面的能力已经提升到在各种教育考试基准测试中表现出色的程度,包括医学知识测试。在多项选择题医学知识测试中,将这两种大语言模型的表现与家庭医学住院医师的表现进行比较,可以深入了解它们作为医学教育工具的潜力。
本研究旨在定量和定性地比较GPT-3.5、GPT-4和家庭医学住院医师在适合家庭医学住院医师水平的多项选择题医学知识测试中的表现。
将由多伦多大学家庭与社区医学系官方提供的包含多项选择题的进展测试输入到GPT-3.5和GPT-4中。人工审核人工智能聊天机器人的回答,以确定所选答案、回答长度、回答时间、为输出回答提供的理由,以及所有错误回答的根本原因(分为算术错误、逻辑错误和信息错误)。将人工智能聊天机器人的表现与同时参加该测试的一组家庭医学住院医师的表现进行比较。
与GPT-3.5相比,GPT-4的表现显著更好(差异25.0%,95%置信区间16.3%-32.8%;McNemar检验:P<.001);它正确回答了89/108(82.4%)的问题,而GPT-3.5正确回答了62/108(57.4%)的问题。此外,GPT-4在家庭医学知识的所有11个类别中得分更高。在86.1%(n=93)的回答中,GPT-4提供了不选择其他多项选择题选项的理由,而GPT-3.5为16.7%(n=18)的回答提供了理由。定性地说,对于GPT-3.5和GPT-4的回答,逻辑错误最为常见,而算术错误最不常见。家庭医学住院医师的平均表现为56.9%(95%置信区间56.2%-57.6%)。GPT-3.5的表现与家庭医学住院医师的平均表现相似(P=.16),而GPT-4的表现超过了表现最佳的家庭医学住院医师(P<.001)。
在为家庭医学住院医师设计的多项选择题医学知识测试中,GPT-4的表现显著优于GPT-3.5和家庭医学住院医师。GPT-4为其回答选择提供了逻辑依据,高效且简洁地排除了其他答案选项。其高度的准确性和先进的推理能力促进了其在医学教育中的潜在应用,包括创建考试题目和场景,以及作为医学知识或社区服务信息的资源。