Li Chunxiao, Rudin Cynthia, McCormick Tyler H
Department of Statistical Science, Duke University, Durham, NC 27708, USA.
Departments of Computer Science, Electrical and Computer Engineering, Statistical Science, Mathematics and Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, USA.
J Mach Learn Res. 2022;23.
Instrumental variables (IV) are widely used in the social and health sciences in situations where a researcher would like to measure a causal effect but cannot perform an experiment. For valid causal inference in an IV model, there must be external (exogenous) variation that (i) has a sufficiently large impact on the variable of interest (called the ) and where (ii) the only pathway through which the external variation impacts the outcome is via the variable of interest (called the ). For statistical inference, researchers must also make assumptions about the functional form of the relationship between the three variables. Current practice assumes (i) and (ii) are met, then postulates a functional form with limited input from the data. In this paper, we describe a framework that leverages machine learning to validate these typically unchecked but consequential assumptions in the IV framework, providing the researcher empirical evidence about the quality of the instrument given the data at hand. Central to the proposed approach is the idea of . Prediction validity checks that error terms - which should be independent from the instrument - cannot be modeled with machine learning any better than a model that is identically zero. We use prediction validity to develop both one-stage and two-stage approaches for IV, and demonstrate their performance on an example relevant to climate change policy.
在研究人员希望衡量因果效应但无法进行实验的情况下,工具变量(IV)在社会科学和健康科学中被广泛使用。对于IV模型中的有效因果推断,必须存在外部(外生)变化,该变化(i)对感兴趣的变量(称为 )有足够大的影响,并且(ii)外部变化影响结果的唯一途径是通过感兴趣的变量(称为 )。对于统计推断,研究人员还必须对三个变量之间关系的函数形式做出假设。当前的做法是假设(i)和(ii)得到满足,然后在数据输入有限的情况下假设一种函数形式。在本文中,我们描述了一个利用机器学习来验证IV框架中这些通常未经检验但却很重要的假设的框架,为研究人员提供关于给定手头数据时工具质量的实证证据。所提出方法的核心是 的思想。预测有效性检查误差项——误差项应该与工具变量无关——用机器学习建模时,不会比一个恒为零的模型更好。我们使用预测有效性来开发IV的单阶段和两阶段方法,并在一个与气候变化政策相关的示例中展示它们的性能。