Wiberg Marie, Laukaityte Inga
Umeå University, Sweden.
Appl Psychol Meas. 2025 Mar 24:01466216251330305. doi: 10.1177/01466216251330305.
Test score equating is used to make scores from different test forms comparable, even when groups differ in ability. In practice, the non-equivalent group with anchor test (NEAT) design is commonly used. The overall aim was to compare the amount of bias under different conditions when using either chained equating or frequency estimation with five different criterion functions: the identity function, linear equating, equipercentile, chained equating and frequency estimation. We used real test data from a multiple-choice binary scored college admissions test to illustrate that the choice of criterion function matter. Further, we simulated data in line with the empirical data to examine difference in ability between groups, difference in item difficulty, difference in anchor test form and regular test form length, difference in correlations between anchor test form and regular test forms, and different sample size. The results indicate that how bias is defined heavily affects the conclusions we draw about which equating method is to be preferred in different scenarios. Practical implications of this in standardized tests are given together with recommendations on how to calculate bias when evaluating equating transformations.
测验分数等值用于使不同测验形式的分数具有可比性,即使不同群体的能力存在差异。在实际应用中,常用的是带锚定测验的非等组设计(NEAT)。总体目标是比较在使用链式等值或频率估计时,采用五种不同的准则函数(恒等函数、线性等值、等百分位等值、链式等值和频率估计)在不同条件下的偏差量。我们使用了来自一个多项选择题二分计分的大学入学考试的真实测验数据,以说明准则函数的选择很重要。此外,我们根据实证数据模拟数据,以检验群体间能力差异、题目难度差异、锚定测验形式和常规测验形式长度差异、锚定测验形式与常规测验形式之间的相关性差异以及不同样本量的情况。结果表明,偏差的定义方式严重影响我们在不同场景下关于哪种等值方法更优得出的结论。文中给出了这在标准化测验中的实际意义,以及关于在评估等值转换时如何计算偏差的建议。