Roos Jonas, Kasapovic Adnan, Jansen Tom, Kaczmarczyk Robert
Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, Germany.
Department of Dermatology and Allergy, Technical University of Munich, Munich, Germany.
JMIR Med Educ. 2023 Sep 4;9:e46482. doi: 10.2196/46482.
Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.
This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.
The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.
GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.
LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.
大语言模型(LLMs)已在包括医学在内的多个领域展现出巨大潜力。然而,针对其在医学考试中的表现,尤其是非英语语言的医学考试以及与医学生的直接比较的研究却很匮乏。分析大语言模型在国家医学考试中的表现,能够深入了解其能力和局限性,并评估它们在医学教育和考试准备中的潜在作用。
本研究旨在评估和比较3种大语言模型(GPT-4、必应和GPT-3.5-Turbo)在2022年德国国家医学考试中的表现,并评估它们相对于医学生的表现。
使用2022年德国国家医学考试春秋季的总共630道题目对大语言模型进行评估。在有和没有与媒体相关题目的情况下评估表现。统计分析包括单向方差分析和用于两两比较的独立样本t检验。还评估了大语言模型相对于学生的相对优势。
GPT-4取得了最高的总体表现,正确回答了88.1%的问题,紧随其后的是必应(86.0%)和GPT-3.5-Turbo(65.7%)。学生的平均正确答案率为74.6%。在两次考试中,GPT-4和必应的表现均显著优于学生。排除媒体相关题目后,必应的表现最高,为90.7%,紧随其后的是GPT-4(90.4%),而GPT-3.5-Turbo则落后(68.2%)。在2022年秋季考试中,GPT-4和必应的表现显著下降,这归因于与媒体相关题目的比例更高以及题目难度可能增加。
大语言模型,尤其是GPT-4和必应,在医学教育和预测试题方面展现出作为有价值工具的潜力。它们的高性能,即使相对于医学生而言,也表明了进一步发展并融入教育和临床领域的广阔前景。