Chang Joshua C, Porcino Julia, Rasch Elizabeth K, Tang Larry
Rehabilitation Medicine Department, NIH Clinical Center, Bethesda, Maryland, United States of America.
National Center for Forensic Science, University of Central Florida, Orlando, Florida, United States of America.
PLoS One. 2022 Apr 8;17(4):e0266350. doi: 10.1371/journal.pone.0266350. eCollection 2022.
Item response theory (IRT) is the statistical paradigm underlying a dominant family of generative probabilistic models for test responses, used to quantify traits in individuals relative to target populations. The graded response model (GRM) is a particular IRT model that is used for ordered polytomous test responses. Both the development and the application of the GRM and other IRT models require statistical decisions. For formulating these models (calibration), one needs to decide on methodologies for item selection, inference, and regularization. For applying these models (test scoring), one needs to make similar decisions, often prioritizing computational tractability and/or interpretability. In many applications, such as in the Work Disability Functional Assessment Battery (WD-FAB), tractability implies approximating an individual's score distribution using estimates of mean and variance, and obtaining that score conditional on only point estimates of the calibrated model. In this manuscript, we evaluate the calibration and scoring of models under this common use-case using Bayesian cross-validation. Applied to the WD-FAB responses collected for the National Institutes of Health, we assess the predictive power of implementations of the GRM based on their ability to yield, on validation sets of respondents, ability estimates that are most predictive of patterns of item responses. Our main finding indicates that regularized Bayesian calibration of the GRM outperforms the regularization-free empirical Bayesian procedure of marginal maximum likelihood. We also motivate the use of compactly supported priors in test scoring.
项目反应理论(IRT)是用于测试反应的一类主要生成概率模型背后的统计范式,用于相对于目标人群量化个体特征。等级反应模型(GRM)是一种特定的IRT模型,用于有序多分类测试反应。GRM和其他IRT模型的开发与应用都需要进行统计决策。为了构建这些模型(校准),需要决定项目选择、推断和正则化的方法。为了应用这些模型(测试评分),也需要做出类似的决策,通常会优先考虑计算的易处理性和/或可解释性。在许多应用中,例如在工作残疾功能评估量表(WD-FAB)中,易处理性意味着使用均值和方差估计来近似个体的分数分布,并仅基于校准模型的点估计来获得该分数。在本手稿中,我们使用贝叶斯交叉验证评估在此常见用例下模型的校准和评分。应用于为美国国立卫生研究院收集的WD-FAB反应,我们基于GRM实现对验证集中受访者的能力估计,评估其对项目反应模式的预测能力,从而评估GRM实现的预测能力。我们的主要发现表明,GRM的正则化贝叶斯校准优于无正则化的边际最大似然经验贝叶斯程序。我们还提倡在测试评分中使用紧支先验。