Nethery Rachel C, Mealli Fabrizia, Dominici Francesca
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
Department of Statistics, Informatics, Applications, University of Florence, Florence, Italy.
Ann Appl Stat. 2019 Jun;13(2):1242-1267. doi: 10.1214/18-AOAS1231. Epub 2019 Jun 17.
Most causal inference studies rely on the assumption of overlap to estimate population or sample average causal effects. When data suffer from non-overlap, estimation of these estimands requires reliance on model specifications, due to poor data support. All existing methods to address non-overlap, such as trimming or down-weighting data in regions of poor data support, change the estimand so that inference cannot be made on the sample or the underlying population. In environmental health research settings, where study results are often intended to influence policy, population-level inference may be critical, and changes in the estimand can diminish the impact of the study results, because estimates may not be representative of effects in the population of interest to policymakers. Researchers may be willing to make additional, minimal modeling assumptions in order to preserve the ability to estimate population average causal effects. We seek to make two contributions on this topic. First, we propose a flexible, data-driven definition of propensity score overlap and non-overlap regions. Second, we develop a novel Bayesian framework to estimate population average causal effects with minor model dependence and appropriately large uncertainties in the presence of non-overlap and causal effect heterogeneity. In this approach, the tasks of estimating causal effects in the overlap and non-overlap regions are delegated to two distinct models, suited to the degree of data support in each region. Tree ensembles are used to non-parametrically estimate individual causal effects in the overlap region, where the data can speak for themselves. In the non-overlap region, where insufficient data support means reliance on model specification is necessary, individual causal effects are estimated by extrapolating trends from the overlap region via a spline model. The promising performance of our method is demonstrated in simulations. Finally, we utilize our method to perform a novel investigation of the causal effect of natural gas compressor station exposure on cancer outcomes. Code and data to implement the method and reproduce all simulations and analyses is available on Github (https://github.com/rachelnethery/overlap).
大多数因果推断研究依赖重叠假设来估计总体或样本平均因果效应。当数据存在非重叠情况时,由于数据支持不足,对这些估计量的估计需要依赖模型设定。所有现有的解决非重叠问题的方法,如在数据支持不足的区域修剪或降低数据权重,都会改变估计量,从而无法对样本或潜在总体进行推断。在环境卫生研究中,研究结果往往旨在影响政策,总体层面的推断可能至关重要,而估计量的变化会削弱研究结果的影响,因为估计可能无法代表政策制定者感兴趣的总体中的效应。研究人员可能愿意做出额外的、最小化的建模假设,以保留估计总体平均因果效应的能力。我们试图在这个主题上做出两点贡献。首先,我们提出了一种灵活的、数据驱动的倾向得分重叠和非重叠区域的定义。其次,我们开发了一种新颖的贝叶斯框架,以在存在非重叠和因果效应异质性的情况下,以较小的模型依赖性和适当大的不确定性来估计总体平均因果效应。在这种方法中,在重叠和非重叠区域估计因果效应的任务被委托给两个不同的模型,这两个模型适合每个区域的数据支持程度。树集成用于非参数估计重叠区域的个体因果效应,在该区域数据可以自行说明情况。在非重叠区域,由于数据支持不足意味着必须依赖模型设定,个体因果效应通过样条模型从重叠区域外推趋势来估计。我们方法在模拟中展示了良好的性能。最后,我们利用我们的方法对天然气压缩站暴露对癌症结局的因果效应进行了新颖的研究。实现该方法并重现所有模拟和分析的代码和数据可在Github上获取(https://github.com/rachelnethery/overlap)。