Emekli Emre, Karahan Betül Nalan
Department of Radiology, Eskişehir Osmangazi University, Faculty of Medicine, Eskişehir, Türkiye.
J Med Imaging Radiat Sci. 2025 Mar 28;56(4):101896. doi: 10.1016/j.jmir.2025.101896.
High-quality multiple-choice questions (MCQs) are essential for effective student assessment in health education. However, the manual creation of MCQs is labour-intensive, requiring significant time and expertise. With the increasing demand for large and continuously updated question banks, artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT, has emerged as a potential tool for automating question generation. While AI-assisted question generation has shown promise, its ability to match human-authored MCQs in terms of difficulty and discrimination indices remains unclear. This study aims to compare the effectiveness of AI-generated and faculty-authored MCQs in radiography education, addressing a critical gap in evaluating AI's role in assessment processes. The findings will be beneficial for educators and curriculum designers exploring AI integration into health education.
This study was conducted in Turkey during the 2024-2025 academic year. Participants included 56 students enrolled in the first year of the Medical Imaging Programme. Two separate 30-question MCQ exams were developed-one generated by ChatGPT-4o and the other by a faculty member. The questions were derived from radiographic anatomy and positioning content, covering topics such as cranial, vertebral, pelvic, and lower extremity radiographs. Each exam contained six questions per topic, categorised into easy, medium, and difficult levels. A quantitative research design was employed. Students took both exams on separate days, without knowing the source of the questions. Difficulty and discrimination indices were calculated for each question, and student feedback was collected using a 5-point Likert scale to evaluate their perceptions of the exams.
A total of 56 out of 80 eligible students participated, yielding a response rate of 70 %. The mean number of correct answers are similar for ChatGPT (14.91 ± 4.25) and human expert exams (15.82 ± 4.73; p = 0.089). Exam scores showed moderate positive correlation (r = 0.628, p < 0.001). ChatGPT achieved an average difficulty index of 0.50 versus 0.53 for human experts. Discrimination indices were acceptable for 73.33 % of ChatGPT questions and 86.67 % of human expert questions.
LLMs like ChatGPT can generate MCQs of comparable quality to human expert questions, though slight limitations in discrimination and difficulty alignment remain. These models hold promise for supplementing assessment processes in health education.
高质量的多项选择题(MCQs)对于健康教育中有效的学生评估至关重要。然而,手动创建多项选择题需要耗费大量人力,需要大量时间和专业知识。随着对大型且不断更新的题库需求的增加,人工智能(AI),特别是像ChatGPT这样的大语言模型(LLMs),已成为自动生成问题的潜在工具。虽然人工智能辅助生成问题已显示出前景,但其在难度和区分指数方面与人工编写的多项选择题相匹配的能力仍不明确。本研究旨在比较人工智能生成的和教师编写的多项选择题在放射摄影教育中的有效性,填补评估人工智能在评估过程中作用的关键空白。研究结果将有助于探索将人工智能整合到健康教育中的教育工作者和课程设计者。
本研究于2024 - 2025学年在土耳其进行。参与者包括56名医学影像专业一年级的学生。开发了两场单独的30道多项选择题考试——一场由ChatGPT - 4o生成,另一场由一名教师编写。问题源自放射解剖学和定位内容,涵盖如颅骨、脊椎、骨盆和下肢X光片等主题。每场考试每个主题包含6道题,分为简单、中等和困难级别。采用定量研究设计。学生在不同日期参加两场考试,且不知道问题来源。计算每个问题的难度和区分指数,并使用5点李克特量表收集学生反馈,以评估他们对考试的看法。
80名符合条件的学生中有56名参与,回复率为70%。ChatGPT考试(14.91 ± 4.25)和人类专家考试(15.82 ± 4.73;p = 0.089)的平均正确答案数量相似。考试成绩显示出中等程度的正相关(r = 0.628,p < 0.001)。ChatGPT的平均难度指数为0.50,而人类专家的为0.53。ChatGPT问题的73.33%和人类专家问题的86.67%的区分指数是可接受的。
像ChatGPT这样的大语言模型可以生成与人类专家问题质量相当的多项选择题,尽管在区分度和难度匹配方面仍存在轻微限制。这些模型有望补充健康教育中的评估过程。