Tekin Ayla, Karamus Nizameddin Fatih, Çolak Tuncay
Faculty of Medicine, Anatomy Department, Kocaeli University, Umuttepe Campus, Kocaeli, 41001, Turkey.
Faculty of Medicine, Anatomy Department, Altinbas University, İstanbul, Turkey.
Surg Radiol Anat. 2025 Jun 10;47(1):158. doi: 10.1007/s00276-025-03667-z.
The study aimed to evaluate the effectiveness of anatomy multiple-choice questions (MCQs) generated by GPT-4, focused on their methodological appropriateness and alignment with the cognitive levels defined by Bloom's revised taxonomy to enhance assessment.
The assessment questions developed for medical students were created utilizing GPT-4, comprising 240 MCQs organized into subcategories consistent with Bloom's revised taxonomy. When designing prompts to create MCQs, details about the lesson's purpose, learning objectives, and students' prior experiences were included to ensure the questions were contextually appropriate. A set of 30 MCQs was randomly selected from the generated questions for testing. A total of 280 students participated in the examination, which assessed the difficulty index of the MCQs, the item discrimination index, and the overall test difficulty level. Expert anatomists examined the taxonomy accuracy of GPT-4's questions.
Students achieved a median score of 50 (range, 36.67-60) points on the test. The test's internal consistency, assessed by KR-20, was 0.737. The average difficulty of the test was 0.5012. Results show difficulty and discrimination indices for each AI-generated question. Expert anatomists' taxonomy-based classifications matched GPT-4's 26.6%. Meanwhile, 80.9% of students found the questions were clear, and 85.8% showed interest in retaking the assessment exam.
This study demonstrates GPT-4's significant potential for generating medical education exam questions. While it effectively assesses basic knowledge recall, it fails to sufficiently evaluate higher-order cognitive processes outlined in Bloom's revised taxonomy. Future research should consider alternative methods that combine AI with expert evaluation and specialized multimodal models.
本研究旨在评估由GPT-4生成的解剖学多项选择题(MCQ)的有效性,重点关注其方法的适当性以及与布鲁姆修订分类法所定义的认知水平的一致性,以加强评估。
利用GPT-4为医学生开发评估问题,包括240道多项选择题,这些题目按照与布鲁姆修订分类法一致的子类别进行组织。在设计生成多项选择题的提示时,纳入了课程目的、学习目标和学生先前经验的详细信息,以确保问题在情境上是合适的。从生成的问题中随机抽取30道多项选择题进行测试。共有280名学生参加了考试,该考试评估了多项选择题的难度指数、项目区分指数以及整体测试难度水平。解剖学专家检查了GPT-4问题的分类准确性。
学生在测试中的中位数分数为50分(范围为36.67 - 60分)。通过KR-20评估的测试内部一致性为0.737。测试的平均难度为0.5012。结果显示了每个由人工智能生成的问题的难度和区分指数。解剖学专家基于分类法的分类与GPT-4的分类匹配度为26.6%。同时,80.9%的学生认为问题清晰,85.8%的学生表示有兴趣再次参加评估考试。
本研究证明了GPT-4在生成医学教育考试问题方面具有巨大潜力。虽然它能有效评估基础知识的回忆,但未能充分评估布鲁姆修订分类法中概述的高阶认知过程。未来的研究应考虑将人工智能与专家评估和专门的多模态模型相结合的替代方法。