Kıyak Yavuz Selim, Soylu Ayşe, Coşkun Özlem, Budakoğlu Işıl İrem, Peker Tuncay Veysel
Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara, Turkey.
Department of Anatomy, Faculty of Medicine, Gazi University, Ankara, Turkey.
Clin Anat. 2025 May;38(4):505-510. doi: 10.1002/ca.24271. Epub 2025 Mar 24.
Developing high-quality multiple-choice questions (MCQs) for medical school exams is effortful and time-consuming. In this study, we investigated the ability of ChatGPT to generate case-based anatomy MCQs with acceptable levels of item difficulty and discrimination for medical school exams. We used ChatGPT to generate case-based anatomy MCQs for an endocrine and urogenital system exam based on a framework for artificial intelligence (AI)-assisted item generation. The questions were evaluated by experts, approved by the department, and administered to 502 second-year medical students (372 Turkish-language, 130 English-language). The items were analyzed to determine the discrimination and difficulty indices. The item discrimination indices ranged from 0.29 to 0.54, indicating acceptable differentiation between high- and low-performing students. All items in Turkish (six out of six) and five out of six in English met the higher discrimination threshold (≥ 0.30) required for large-scale standardized tests. The item difficulty indices ranged from 0.41 to 0.89, most items falling within the moderate difficulty range (0.20-0.80). Therefore, it was concluded that ChatGPT can generate case-based anatomy MCQs with acceptable psychometric properties, offering a promising tool for medical educators. However, human expertise remains crucial for reviewing and refining AI-generated assessment items. Future research should explore AI-generated MCQs across various anatomy topics and investigate different AI models for question generation.
为医学院考试编写高质量的多项选择题既费力又耗时。在本研究中,我们调查了ChatGPT生成基于案例的解剖学多项选择题的能力,这些题目对于医学院考试而言,在题目难度和区分度方面达到了可接受的水平。我们基于人工智能辅助题目生成框架,使用ChatGPT为一场内分泌和泌尿生殖系统考试生成基于案例的解剖学多项选择题。这些题目由专家进行评估,经系里批准后,施测于502名二年级医学生(372名使用土耳其语,130名使用英语)。对这些题目进行分析以确定区分度和难度指数。题目区分度指数在0.29至0.54之间,表明成绩高和低的学生之间有可接受的区分度。所有土耳其语题目(6题中的6题)和6题中的5题英语题目达到了大规模标准化考试所需的更高区分度阈值(≥0.30)。题目难度指数在0.41至0.89之间,大多数题目落在中等难度范围内(0.20 - 0.80)。因此,得出的结论是,ChatGPT能够生成具有可接受心理测量特性的基于案例的解剖学多项选择题,为医学教育工作者提供了一个有前景的工具。然而,人工专业知识对于审查和完善人工智能生成的评估题目仍然至关重要。未来的研究应探索跨各种解剖学主题的人工智能生成的多项选择题,并研究用于题目生成的不同人工智能模型。