Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
Department of Psychology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, USA.
Psychometrika. 2019 Mar;84(1):285-309. doi: 10.1007/s11336-018-9649-2. Epub 2019 Jan 22.
The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.
预测系统在不同人群中测试成绩的差异仍然是一个棘手且尚未解决的科学、专业和社会问题。我们的案例研究使用两阶段最小二乘法(2SLS)估计器来联合评估高风险测试中的测量不变性和预测不变性。因此,我们使用来自大学理事会的数据,根据潜在分数而不是观察分数,检查了不同群体之间的差异。结果表明,对于 SAT 数学(SAT-M)子测试,证据表明在 0.01 水平上,黑人和白人以及西班牙裔人和白人之间的 74.5%和 29.9%的队列分别拒绝了测量不变性的证据。此外,平均而言,具有相同共同因素地位的黑人学生的观察到的 SAT-M 分数比可比白人学生低近三分之一标准差。我们还发现证据表明,SAT-M 测量截距的群体差异可能部分解释了众所周知的预测截距观察差异。此外,结果表明,近四分之一的具有统计学意义的观测截距差异在考虑到 2SLS 过程中预测测量误差后,在 0.05 水平上不再具有统计学意义。我们基于潜在分数的联合测量和预测不变性方法为新的高风险测试研究议程打开了大门,其目标不是简单地评估是否存在基于观察的群体差异以及这些差异的大小和方向。相反,这一研究议程的目标是评估从潜在预测因素分数的影响任何观察到的差异的大小和方向的潜在理论机制(例如,背景因素)开始的因果链。