Bolgova Olena, Shypilova Inna, Mavrych Volodymyr
College of Medicine, Alfaisal University, Al Takhassousi St, Riyadh, 11533, Saudi Arabia.
School of Medicine, St Mathews University, George Town, Cayman Islands.
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.
The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course.
We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05.
On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04).
Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment.
人工智能(AI)的最新进展,尤其是大语言模型(LLMs),开启了各个领域的创新新时代,医学处于这场技术革命的前沿。许多研究表明,在当前的发展水平下,大语言模型可以通过不同的资格考试。然而,回答特定学科相关问题的能力需要验证。
本研究的目的是进行全面分析,比较先进的大语言模型聊天机器人——Claude(Anthropic公司)、GPT-4(OpenAI公司)、Gemini(谷歌)和Copilot(微软)——与医学生在医学生物化学课程中的学业成绩。
我们从课程考试数据库中选取了200道美国医师执照考试(USMLE)风格的多项选择题(MCQs)。这些题目涵盖了不同的复杂程度,分布在23个不同的主题中。包含表格和图片的题目未纳入本研究。2024年8月,根据Claude 3.5 Sonnet、GPT-4-1106、Gemini 1.5 Flash和Copilot连续5次回答这套问卷的准确性来评估结果。使用Statistica 13.5.0.17(TIBCO软件公司)分析数据的基本统计信息。考虑到数据的二元性质,使用卡方检验比较不同聊天机器人的结果,统计显著性水平为P<0.05。
平均而言,所选聊天机器人正确回答了81.1%(标准差12.8%)的问题,比学生的表现高出8.3%(P = 0.02)。在本研究中,Claude在生物化学多项选择题中表现最佳,正确回答了92.5%(185/200)的问题,其次是GPT-4(170/200,85%)、Gemini(157/200,78.5%)和Copilot(128/200,64%)。聊天机器人在以下4个主题中表现最佳:类二十烷酸(平均100%,标准差0%)、生物能量学和电子传递链(平均96.4%,标准差7.2%)、磷酸戊糖途径(平均91.7%,标准差16.7%)和酮体(平均93.8%,标准差12.5%)。Pearson卡方检验表明,所有4个聊天机器人的答案之间存在统计学上的显著关联(P<0.001至P<0.04)。
我们的研究表明,不同的人工智能模型在特定医学领域可能具有独特优势,可用于生物化学课程的针对性支持。这一表现凸显了人工智能在医学教育和评估中的潜力。