Zhang Xiuyuan, McDermott Paul A, Fantuzzo John W, Gadsden Vivian L
University of Pennsylvania, USA.
Psychol Rep. 2013 Aug;113(1):1303-25. doi: 10.2466/03.10.pr0.113x11z6.
A multiscale criterion-referenced test that featured two presumably equivalent forms (A and B), was administered to 1,667 Head Start children at each of four points over an academic year. Using a randomly equivalent groups design, three equating methods were applied: common-item IRT equating using concurrent calibration, linear transformation, and equipercentile transformation. The methods were compared by examining mean score differences, weighted mean squared difference, and Kolmogorov's D statistics for each subscale. The results indicated that over time the IRT equating method and conventional equating methods exhibited different patterns of discrepancy between the two test forms. IRT equating yielded marginally smaller form-to-form mean score differences and generated slightly fewer distributional discrepancies between Forms A and B than both linear and equipercentile equating. However, the results were mixed indicating that more studies are needed to provide additional information on the relative merits and weaknesses of each approach.
一项多尺度标准参照测试采用了两种假定等效的形式(A和B),在一学年的四个时间点对1667名开端计划儿童进行了测试。采用随机等效组设计,应用了三种等值方法:使用同时校准的共同项目IRT等值、线性变换和等百分位变换。通过检查每个子量表的平均分数差异、加权均方差异和科尔莫戈罗夫D统计量来比较这些方法。结果表明,随着时间的推移,IRT等值方法和传统等值方法在两种测试形式之间表现出不同的差异模式。与线性等值和等百分位等值相比,IRT等值产生的形式间平均分数差异略小,并且在A表和B表之间产生的分布差异略少。然而,结果喜忧参半,这表明需要更多的研究来提供关于每种方法相对优缺点的更多信息。