Alfertshofer Michael, Knoedler Samuel, Hoch Cosima C, Cotofana Sebastian, Panayi Adriana C, Kauke-Navarro Martin, Tullius Stefan G, Orgill Dennis P, Austen William G, Pomahac Bohdan, Knoedler Leonard
Department of Oral and Maxillofacial Surgery, Ludwig-Maximilians-University Munich, Munich, Germany.
Department of Plastic Surgery and Hand Surgery, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany.
Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.
The potential of artificial intelligence (AI) and large language models like ChatGPT in medical applications is promising, yet its performance requires comprehensive evaluation. This study assessed ChatGPT's capabilities in answering USMLE® Step 2CK questions, analyzing its performance across medical specialties, question types, and difficulty levels in a large-scale question test set to assist question writers in developing AI-resistant exam questions and provide medical students with a realistic understanding of how AI can enhance their active learning.
A total of =3302 USMLE® Step 2CK practice questions were extracted from the AMBOSS© study platform, excluding 302 image-based questions, leaving 3000 text-based questions for analysis. Questions were manually entered into ChatGPT and its accuracy and performance across various categories and difficulties were evaluated.
ChatGPT answered 57.7% of all questions correctly. Highest performance scores were found in the category "Male Reproductive System" (71.7%) while the lowest were found in the category "Immune System" (46.3%). Lower performance was noted in table-based questions, and a negative correlation was found between question difficulty and performance ( =-0.285, <0.001). Longer questions tended to be answered incorrectly more often ( =-0.076, <0.001), with a significant difference in length of correctly versus incorrectly answered questions.
ChatGPT demonstrated proficiency close to the passing threshold for USMLE® Step 2CK. Performance varied by category, question type, and difficulty. These findings aid medical educators make their exams more AI-proof and inform the integration of AI tools like ChatGPT into teaching strategies. For students, understanding the model's limitations and capabilities ensures it is used as an auxiliary resource to foster active learning rather than abusing it as a study replacement. This study highlights the need for further refinement and improvement in AI models for medical education and decision-making.
人工智能(AI)以及像ChatGPT这样的大语言模型在医学应用中的潜力巨大,但仍需对其性能进行全面评估。本研究评估了ChatGPT回答美国医师执照考试(USMLE®)第二步临床知识(Step 2CK)问题的能力,在一个大规模问题测试集中分析其在各个医学专业、问题类型和难度水平上的表现,以帮助出题者编写对AI有抗性的考试题目,并让医学生切实了解AI如何增强他们的主动学习。
从AMBOSS©学习平台提取了总共3302道USMLE® Step 2CK练习题,排除302道基于图像的题目,剩下3000道基于文本的题目用于分析。将题目手动输入ChatGPT,并评估其在各类别和难度下的准确性和表现。
ChatGPT正确回答了所有问题的57.7%。在“男性生殖系统”类别中表现得分最高(71.7%),而在“免疫系统”类别中得分最低(46.3%)。基于表格的问题表现较低,并且发现问题难度与表现之间存在负相关(r = -0.285,P < 0.001)。较长的问题往往更常被答错(r = -0.076,P < 0.001),正确回答与错误回答的问题长度存在显著差异。
ChatGPT在USMLE® Step 2CK中的表现接近及格门槛。表现因类别、问题类型和难度而异。这些发现有助于医学教育工作者使他们的考试更具抗AI能力,并为将ChatGPT等AI工具融入教学策略提供参考。对于学生而言,了解该模型的局限性和能力可确保将其用作促进主动学习的辅助资源,而不是将其滥用作学习替代品。本研究强调了在医学教育和决策的AI模型方面进一步优化和改进的必要性。