Birmingham Medical School, University of Birmingham, Birmingham, UK.
Imperial College School of Medicine, Imperial College London, London, UK.
BMC Med Educ. 2023 Sep 11;23(1):659. doi: 10.1186/s12909-023-04457-0.
Automated Item Generation (AIG) uses computer software to create multiple items from a single question model. There is currently a lack of data looking at whether item variants to a single question result in differences in student performance or human-derived standard setting. The purpose of this study was to use 50 Multiple Choice Questions (MCQs) as models to create four distinct tests which would be standard set and given to final year UK medical students, and then to compare the performance and standard setting data for each.
Pre-existing questions from the UK Medical Schools Council (MSC) Assessment Alliance item bank, created using traditional item writing techniques, were used to generate four 'isomorphic' 50-item MCQ tests using AIG software. Isomorphic questions use the same question template with minor alterations to test the same learning outcome. All UK medical schools were invited to deliver one of the four papers as an online formative assessment for their final year students. Each test was standard set using a modified Angoff method. Thematic analysis was conducted for item variants with high and low levels of variance in facility (for student performance) and average scores (for standard setting).
Two thousand two hundred eighteen students from 12 UK medical schools participated, with each school using one of the four papers. The average facility of the four papers ranged from 0.55-0.61, and the cut score ranged from 0.58-0.61. Twenty item models had a facility difference > 0.15 and 10 item models had a difference in standard setting of > 0.1. Variation in parameters that could alter clinical reasoning strategies had the greatest impact on item facility.
Item facility varied to a greater extent than the standard set. This difference may relate to variants causing greater disruption of clinical reasoning strategies in novice learners compared to experts, but is confounded by the possibility that the performance differences may be explained at school level and therefore warrants further study.
自动化项目生成(AIG)使用计算机软件从单个问题模型创建多个项目。目前缺乏关于单一问题的项目变体是否会导致学生表现或人为标准设定的差异的数据。本研究的目的是使用 50 个多项选择题(MCQ)作为模型,创建四个不同的测试,这些测试将由英国医学生进行标准设定并进行测试,然后比较每个测试的表现和标准设定数据。
使用传统的项目编写技术从英国医学理事会(MSC)评估联盟试题库中创建的现有问题,使用 AIG 软件生成四个“同构”的 50 项 MCQ 测试。同构问题使用相同的问题模板,进行微小的改动来测试相同的学习成果。所有英国医学院都被邀请为他们的最后一年学生提供这四个论文中的一个作为在线形成性评估。每个测试都使用修改后的 Angoff 方法进行标准设定。对具有高变异性和低变异性的项目变体进行主题分析(用于学生表现)和平均分数(用于标准设定)。
来自 12 所英国医学院的 2218 名学生参加了研究,每个学校都使用了这四个论文中的一个。四个论文的平均便利性从 0.55 到 0.61,而切分分数从 0.58 到 0.61。有 20 个项目模型的便利性差异大于 0.15,有 10 个项目模型的标准设定差异大于 0.1。可能改变临床推理策略的参数变化对项目便利性的影响最大。
项目便利性的差异大于标准设定。这种差异可能与在新手学习者中引起更大的临床推理策略中断的变体有关,但也可能与学校层面的表现差异有关,因此需要进一步研究。