Kern Holger L, Stuart Elizabeth A, Hill Jennifer, Green Donald P
Department of Political Science, Florida State University.
Departments of Mental Health, Biostatistics, and Health, Policy, and Management, Bloomberg School of Public Health, Johns Hopkins University.
J Res Educ Eff. 2016;9(1):103-127. doi: 10.1080/19345747.2015.1060282. Epub 2016 Jan 14.
Randomized experiments are considered the gold standard for causal inference, as they can provide unbiased estimates of treatment effects for the experimental participants. However, researchers and policymakers are often interested in using a specific experiment to inform decisions about other target populations. In education research, increasing attention is being paid to the potential lack of generalizability of randomized experiments, as the experimental participants may be unrepresentative of the target population of interest. This paper examines whether generalization may be assisted by statistical methods that adjust for observed differences between the experimental participants and members of a target population. The methods examined include approaches that reweight the experimental data so that participants more closely resemble the target population and methods that utilize models of the outcome. Two simulation studies and one empirical analysis investigate and compare the methods' performance. One simulation uses purely simulated data while the other utilizes data from an evaluation of a school-based dropout prevention program. Our simulations suggest that machine learning methods outperform regression-based methods when the required structural (ignorability) assumptions are satisfied. When these assumptions are violated, all of the methods examined perform poorly. Our empirical analysis uses data from a multi-site experiment to assess how well results from a given site predict impacts in other sites. Using a variety of extrapolation methods, predicted effects for each site are compared to actual benchmarks. Flexible modeling approaches perform best, although linear regression is not far behind. Taken together, these results suggest that flexible modeling techniques can aid generalization while underscoring the fact that even state-of-the-art statistical techniques still rely on strong assumptions.
随机实验被视为因果推断的黄金标准,因为它们能够为实验参与者提供无偏的治疗效果估计。然而,研究人员和政策制定者常常希望利用特定实验为有关其他目标人群的决策提供依据。在教育研究中,随机实验的可推广性可能存在的不足日益受到关注,因为实验参与者可能无法代表感兴趣的目标人群。本文探讨是否可以通过统计方法来辅助推广,这些方法用于调整实验参与者与目标人群成员之间观察到的差异。所考察的方法包括对实验数据重新加权以使参与者更接近目标人群的方法,以及利用结果模型的方法。两项模拟研究和一项实证分析对这些方法的性能进行了调查和比较。一项模拟使用纯模拟数据,另一项则利用一项基于学校的辍学预防项目评估中的数据。我们的模拟表明,当所需的结构(可忽略性)假设得到满足时,机器学习方法优于基于回归的方法。当这些假设被违反时,所考察的所有方法表现都很差。我们的实证分析使用来自多地点实验的数据,以评估给定地点的结果对其他地点影响的预测效果如何。使用各种外推方法,将每个地点的预测效果与实际基准进行比较。灵活的建模方法表现最佳,尽管线性回归也相差不远。综合来看,这些结果表明灵活的建模技术有助于推广,同时也强调了即使是最先进的统计技术仍然依赖于强假设这一事实。