Wang Dongwei, Keller Lisa A
University of Massachusetts Amherst, USA.
Educ Psychol Meas. 2024 Sep 24:00131644241278925. doi: 10.1177/00131644241278925.
In educational assessment, cut scores are often defined through standard setting by a group of subject matter experts. This study aims to investigate the impact of several factors on classification accuracy using the receiver operating characteristic (ROC) analysis to provide statistical and theoretical evidence when the cut score needs to be refined. Factors examined in the study include the sample distribution relative to the cut score, prevalence of the positive event, and cost ratio. Forty item responses were simulated for examinees of four sample distributions. In addition, the prevalence and cost ratio between false negatives and false positives were manipulated to examine their impacts on classification accuracy. The optimal cut score is identified using the Youden Index . The results showed that the optimal cut score identified by the evaluation criterion tended to pull the cut score closer to the mode of the proficiency distribution. In addition, depending on the prevalence of the positive event and cost ratio, the optimal cut score shifts accordingly. With the item parameters used to simulate the data and the simulated sample distributions, it was found that when passing the exam is a low-prevalence event in the population, increasing the cut score operationally improves the classification; when passing the exam is a high-prevalence event, then cut score should be reduced to achieve optimality. As the cost ratio increases, the optimal cut score suggested by the evaluation criterion decreases. In three out of the four sample distributions examined in this study, increasing the cut score enhanced the classification, irrespective of the cost ratio when the prevalence in the population is 50%. This study provides statistical evidence when the cut score needs to be refined for policy reasons.
在教育评估中,分数线通常由一组学科专家通过标准设定来确定。本研究旨在使用受试者工作特征(ROC)分析来调查几个因素对分类准确性的影响,以便在需要细化分数线时提供统计和理论依据。该研究中考察的因素包括相对于分数线的样本分布、阳性事件的患病率以及成本比率。针对四种样本分布的考生模拟了40个项目反应。此外,还对假阴性和假阳性之间的患病率和成本比率进行了操作,以考察它们对分类准确性的影响。使用约登指数确定最佳分数线。结果表明,评估标准确定的最佳分数线往往会使分数线更接近能力分布的众数。此外,根据阳性事件的患病率和成本比率,最佳分数线会相应地移动。利用用于模拟数据的项目参数和模拟的样本分布,发现当考试及格在人群中是低患病率事件时,实际提高分数线会改善分类;当考试及格是高患病率事件时,则应降低分数线以达到最优。随着成本比率的增加,评估标准建议的最佳分数线会降低。在本研究考察的四种样本分布中的三种分布中,当人群中的患病率为50%时,无论成本比率如何,提高分数线都会增强分类效果。本研究为因政策原因需要细化分数线时提供了统计证据。