Sherman Eli, Shpitser Ilya
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218,
Adv Neural Inf Process Syst. 2018 Dec;2018:9446-9457.
The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [17, 11, 26, 21], even when unobserved confounding bias may be present. But, as pointed out in [19], causal inference in non-iid contexts is challenging due to the presence of both unobserved confounding data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We use [20], a generalization of latent projection mixed graphs [28], to represent causal models of this type and provide a complete algorithm for nonparametric identification in these models. We then demonstrate how statistical inference may be performed on causal parameters identified by this algorithm. In particular, we consider cases where only a single sample is available for parts of the model due to , i.e., all units are pathwise dependent and neighbors' treatments affect each others' outcomes [24]. We apply these techniques to a synthetic data set which considers users sharing fake news articles given the structure of their social network, user activity levels, and baseline demographics and socioeconomic covariates.
数据样本是独立同分布(iid)的假设在统计学和机器学习的许多领域都是标准的。然而,在某些情况下,如社交网络、传染病建模以及对时空数据的推理中,这个假设是错误的。关于在iid假设下进行因果推断,即使可能存在未观察到的混杂偏差,也有大量的文献[17, 11, 26, 21]。但是,正如[19]中所指出的,由于存在未观察到的混杂和数据依赖性,在非iid背景下进行因果推断具有挑战性。在本文中,我们发展了一种通用理论,描述了在这种情况下何时可以进行因果推断。我们使用[20],即潜在投影混合图[28]的一种推广,来表示这种类型的因果模型,并为这些模型中的非参数识别提供了一个完整的算法。然后,我们展示了如何对由该算法识别出的因果参数进行统计推断。特别是,我们考虑由于[24],即所有单元都是路径依赖的且邻居的处理会影响彼此的结果,导致模型的某些部分只有单个样本可用的情况。我们将这些技术应用于一个合成数据集,该数据集考虑了给定社交网络结构、用户活动水平以及基线人口统计学和社会经济协变量的情况下,用户分享假新闻文章的情况。