National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA, 19104-3102, USA.
American Institutes for Research, 1000 Thomas Jefferson Street, NW, Washington D.C., 20007, USA.
Psychometrika. 2019 Mar;84(1):147-163. doi: 10.1007/s11336-018-09652-3. Epub 2019 Jan 3.
This paper provides results on a form of adaptive testing that is used frequently in intelligence testing. In these tests, items are presented in order of increasing difficulty. The presentation of items is adaptive in the sense that a session is discontinued once a test taker produces a certain number of incorrect responses in sequence, with subsequent (not observed) responses commonly scored as wrong. The Stanford-Binet Intelligence Scales (SB5; Riverside Publishing Company, 2003) and the Kaufman Assessment Battery for Children (KABC-II; Kaufman and Kaufman, 2004), the Kaufman Adolescent and Adult Intelligence Test (Kaufman and Kaufman 2014) and the Universal Nonverbal Intelligence Test (2nd ed.) (Bracken and McCallum 2015) are some of the many examples using this rule. He and Wolfe (Educ Psychol Meas 72(5):808-826, 2012. https://doi.org/10.1177/0013164412441937 ) compared different ability estimation methods in a simulation study for this discontinue rule adaptation of test length. However, there has been no study, to our knowledge, of the underlying distributional properties based on analytic arguments drawing on probability theory, of what these authors call stochastic censoring of responses. The study results obtained by He and Wolfe (Educ Psychol Meas 72(5):808-826, 2012. https://doi.org/10.1177/0013164412441937 ) agree with results presented by DeAyala et al. (J Educ Meas 38:213-234, 2001) as well as Rose et al. (Modeling non-ignorable missing data with item response theory (IRT; ETS RR-10-11), Educational Testing Service, Princeton, 2010) and Rose et al. (Psychometrika 82:795-819, 2017. https://doi.org/10.1007/s11336-016-9544-7 ) in that ability estimates are biased most when scoring the not observed responses as wrong. This scoring is used operationally, so more research is needed in order to improve practice in this field. The paper extends existing research on adaptivity by discontinue rules in intelligence tests in multiple ways: First, an analytical study of the distributional properties of discontinue rule scored items is presented. Second, a simulation is presented that includes additional scoring rules and uses ability estimators that may be suitable to reduce bias for discontinue rule scored intelligence tests.
本文提供了一种常用于智力测试的自适应测试形式的结果。在这些测试中,项目按照难度递增的顺序呈现。测试的呈现是自适应的,因为一旦测试者连续出现一定数量的错误反应,测试就会停止,随后(未观察到的)反应通常被标记为错误。斯坦福-比奈智力量表(SB5;里弗赛德出版公司,2003 年)和考夫曼儿童评估成套测验(KABC-II;考夫曼和考夫曼,2004 年)、考夫曼成人和青少年智力测验(Kaufman and Kaufman 2014 年)和通用非言语智力测验(第 2 版)(Bracken 和 McCallum 2015 年)是许多使用此规则的例子。He 和 Wolfe(Educ Psychol Meas 72(5):808-826, 2012. https://doi.org/10.1177/0013164412441937)在一项关于测试长度的这种中断规则自适应的模拟研究中比较了不同的能力估计方法。然而,据我们所知,还没有基于概率论的分析论证来研究这些作者所称的反应随机截尾的基本分布特性的研究。He 和 Wolfe(Educ Psychol Meas 72(5):808-826, 2012. https://doi.org/10.1177/0013164412441937)的研究结果与 DeAyala 等人的研究结果一致(J Educ Meas 38:213-234, 2001)以及 Rose 等人的研究结果一致(使用项目反应理论(IRT)对不可忽略的缺失数据进行建模(ETS RR-10-11),教育测试服务,普林斯顿,2010 年)和 Rose 等人的研究结果一致(Psychometrika 82:795-819, 2017. https://doi.org/10.1007/s11336-016-9544-7),即当将未观察到的反应标记为错误时,能力估计的偏差最大。这种评分是在操作中使用的,因此需要更多的研究来改进这一领域的实践。本文通过多种方式扩展了智能测试中中断规则的适应性现有研究:首先,提出了一种对中断规则评分项目分布特性的分析研究。其次,提出了一个模拟,其中包括额外的评分规则,并使用可能适合减少中断规则评分的智力测试偏差的能力估计器。