Kang Hyeon-Ah, Han Suhwa, Kim Doyoung, Kao Shu-Chuan
University of Texas at Austin, Austin, TX, USA.
National Council of State Boards of Nursing, Chicago, IL, USA.
Educ Psychol Meas. 2022 Aug;82(4):811-838. doi: 10.1177/00131644211032261. Epub 2021 Aug 2.
The development of technology-enhanced innovative items calls for practical models that can describe polytomous testlet items. In this study, we evaluate four measurement models that can characterize polytomous items administered in testlets: (a) generalized partial credit model (GPCM), (b) testlet-as-a-polytomous-item model (TPIM), (c) random-effect testlet model (RTM), and (d) fixed-effect testlet model (FTM). Using data from GPCM, FTM, and RTM, we examine performance of the scoring models in multiple aspects: relative model fit, absolute item fit, significance of testlet effects, parameter recovery, and classification accuracy. The empirical analysis suggests that relative performance of the models varies substantially depending on the testlet-effect type, effect size, and trait estimator. When testlets had no or fixed effects, GPCM and FTM led to most desirable measurement outcomes. When testlets had random interaction effects, RTM demonstrated best model fit and yet showed substantially different performance in the trait recovery depending on the estimator. In particular, the advantage of RTM as a scoring model was discernable only when there existed strong random effects and the trait levels were estimated with Bayes priors. In other settings, the simpler models (i.e., GPCM, FTM) performed better or comparably. The study also revealed that polytomous scoring of testlet items has limited prospect as a functional scoring method. Based on the outcomes of the empirical evaluation, we provide practical guidelines for choosing a measurement model for polytomous innovative items that are administered in testlets.
技术增强型创新项目的发展需要能够描述多分类题组项目的实用模型。在本研究中,我们评估了四种能够刻画题组中多分类项目的测量模型:(a)广义部分计分模型(GPCM),(b)题组作为多分类项目模型(TPIM),(c)随机效应题组模型(RTM),以及(d)固定效应题组模型(FTM)。利用来自GPCM、FTM和RTM的数据,我们从多个方面检验了计分模型的表现:相对模型拟合度、绝对项目拟合度、题组效应的显著性、参数恢复以及分类准确性。实证分析表明,模型的相对表现会因题组效应类型、效应大小和特质估计方法的不同而有很大差异。当题组没有效应或有固定效应时,GPCM和FTM能带来最理想的测量结果。当题组有随机交互效应时,RTM表现出最佳的模型拟合度,但根据估计方法的不同,其在特质恢复方面的表现有很大差异。特别是,只有当存在强烈的随机效应且特质水平采用贝叶斯先验估计时,RTM作为计分模型的优势才明显。在其他情况下,更简单的模型(即GPCM、FTM)表现更好或相当。该研究还表明,题组项目的多分类计分作为一种功能计分方法前景有限。基于实证评估的结果,我们为选择用于题组中多分类创新项目的测量模型提供了实用指南。