Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany.
Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians University Munich, Munich, Germany.
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student's knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT's performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited.
This paper aimed to analyze ChatGPT's performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating.
A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions.
Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=-0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=-0.289 for ChatGPT 3.5 and ρ=-0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached.
In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.
自 1992 年以来,美国医师执照考试(USMLE)一直是医学教育的关键,通过不同的步骤,根据学生的培训水平,测试他们知识和技能的各个方面。人工智能(AI)工具,包括 ChatGPT 等聊天机器人,是具有潜在应用的新兴技术在医学领域。然而,全面分析 ChatGPT 在大规模场景下 USMLE 第 3 步的表现,并比较不同版本的 ChatGPT 的综合研究有限。
本文旨在分析 ChatGPT 在 USMLE 第 3 步练习题上的表现,以更好地阐明人工智能在医学教育中的优势和劣势,并得出基于证据的策略来对抗人工智能作弊。
从 AMBOSS 学习平台中提取了 2069 个 USMLE 第 3 步练习题。在包含 229 个基于图像的问题后,进一步将总共 1840 个基于文本的问题分类并输入到 ChatGPT 3.5 中,而 229 个问题的一个子集则输入到 ChatGPT 4 中。记录回答,并比较两个版本的 ChatGPT 回答的准确性以及在不同测试问题类别和不同难度级别上的表现。
总体而言,ChatGPT 4 的表现明显优于 ChatGPT 3.5,准确性分别为 84.7%(194/229)和 56.9%(1047/1840)。ChatGPT 3.5 与测试问题的长度之间存在显著的相关性(ρ=-0.069;P=.003),而 ChatGPT 4 中则不存在(P=.87)。此外,根据 AMBOSS 锤击等级对测试问题的难度进行分类,与两个 ChatGPT 版本的表现均存在统计学相关性,ChatGPT 3.5 的 ρ 值为-0.289,ChatGPT 4 的 ρ 值为-0.344。ChatGPT 4 在除了最高难度级别(4 和 5 锤)之外的所有测试问题难度级别上都超过了 ChatGPT 3.5,但在这些级别上未达到统计学意义。
在这项研究中,ChatGPT 4 在参加 USMLE 第 3 步考试方面表现出色,准确率为 84.7%(194/229),优于准确率为 56.9%(1047/1840)的 ChatGPT 3.5。尽管 ChatGPT 4 表现出色,但它在需要应用理论概念的问题上遇到了困难,特别是在心脏病学和神经病学方面。这些见解对于开发抗人工智能考试策略至关重要,并强调了人工智能在医学教育和诊断领域的有前途的作用。