Flores-Cohaila Javier A, García-Vicente Abigaíl, Vizcarra-Jiménez Sonia F, De la Cruz-Galán Janith P, Gutiérrez-Arratia Jesús D, Quiroga Torres Blanca Geraldine, Taype-Rondan Alvaro
Academic Department, USAMEDIC, Lima, Peru.
Facultad de Ciencias de la Salud, Carrera de Medicina, Universidad Científica del Sur, Lima, Peru.
JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.
ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries' national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education.
We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT.
We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT's accuracy.
GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%).
Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.
ChatGPT在美国医师执照考试(USMLE)等国家医学执照考试中表现出色,甚至达到了专家级水平并通过了考试。然而,对于其在低收入国家的国家医学执照考试中的表现,缺乏相关研究。在秘鲁,近三分之一的考生未能通过国家医学执照考试,ChatGPT有潜力提升医学教育水平。
我们旨在评估使用GPT-3.5和GPT-4的ChatGPT在秘鲁国家医学执照考试(Examen Nacional de Medicina [ENAM])中的准确性。此外,我们试图找出与ChatGPT给出错误答案相关的因素。
我们使用了包含180道多项选择题的ENAM 2022数据集来评估ChatGPT的表现。使用了各种提示,并对准确性进行了评估。将ChatGPT的表现与1025名考生的样本进行了比较。分析了问题类型、秘鲁特定知识、区分度、难度、问题质量和主题等因素,以确定它们对错误答案的影响。对给出错误答案的问题进行了一个三步过程,涉及不同的提示,以探索添加角色和背景对ChatGPT准确性的潜在影响。
GPT-4在ENAM上的准确率达到86%,其次是GPT-3.5,为77%。1025名考生的准确率为55%。GPT-3.5和GPT-4之间存在适度的一致性(κ=0.38)。在GPT-3.5(优势比[OR] 6.6,95%置信区间2.73 - 15.95)和GPT-4(OR 33.23,95%置信区间4.3 - 257.12)的原始模型和调整模型中,中高难度问题与错误答案相关。在重新输入给出错误答案的问题后,GPT-3.5的错误答案从41个(100%)降至12个(29%),GPT-4从25个(100%)降至4个(16%)。
我们的研究发现,ChatGPT(GPT-3.5和GPT-4)在ENAM上可以达到专家级水平,表现优于我们的大多数考生。我们发现GPT-3.5和GPT-4之间存在适度的一致性。错误答案与问题的难度相关,这可能与人类的表现相似。此外,通过使用包含额外角色和背景的不同提示重新输入最初给出错误答案的问题,ChatGPT的准确性得到了提高。