Schulte Niklas, Holling Heinz, Bürkner Paul-Christian
University of Münster, Germany.
Aalto University, Espoo, Finland.
Educ Psychol Meas. 2021 Apr;81(2):262-289. doi: 10.1177/0013164420934861. Epub 2020 Jul 24.
Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interindividual comparisons in high-stakes situations impossible. Several studies suggest that these problems vanish if the number of measured traits is high. To determine the necessary number of traits under varying sample sizes, factor loadings, and intertrait correlations, simulations were performed for the two most widely used scoring methods, namely the classical (ipsative) approach and Thurstonian item response theory (IRT) models. Results demonstrate that while especially Thurstonian IRT models perform well under ideal conditions, both methods yield insufficient reliabilities in most conditions resembling applied contexts. Moreover, not only the classical estimates but also the Thurstonian IRT estimates for questionnaires with equally keyed items remain (partially) ipsative, even when the number of traits is very high (i.e., 30). This result not only questions earlier assumptions regarding the use of classical scores in high-dimensional questionnaires, but it also raises doubts about many validation studies on Thurstonian IRT models because correlations of (partially) ipsative scores with external criteria cannot be interpreted in a usual way.
迫选式问卷可以防止伪装及其他通常与评定量表相关的反应偏差。然而,由此得出的特质分数往往不可靠且具有个体内比较性,使得在高风险情境下进行个体间比较变得不可能。多项研究表明,如果所测量的特质数量较多,这些问题就会消失。为了确定在不同样本量、因子载荷和特质间相关性情况下所需的特质数量,针对两种最广泛使用的计分方法进行了模拟,即经典(个体内比较)方法和瑟斯顿项目反应理论(IRT)模型。结果表明,虽然特别是瑟斯顿IRT模型在理想条件下表现良好,但在大多数类似应用情境的条件下,两种方法的信度都不足。此外,不仅经典估计值,而且对于具有同等关键项目的问卷,瑟斯顿IRT估计值(部分)仍具有个体内比较性,即使特质数量非常多(即30个)。这一结果不仅对早期关于在高维问卷中使用经典分数的假设提出了质疑,也对许多关于瑟斯顿IRT模型的验证研究产生了怀疑,因为(部分)个体内比较分数与外部标准的相关性无法以通常方式进行解释。