Prosperi Mattia, Guo Yi, Bian Jiang
Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, FL, USA.
Cancer Informatics Shared Resource, University of Florida Health Cancer Center, FL, USA; Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, FL, USA.
J Biomed Inform. 2021 Mar;115:103689. doi: 10.1016/j.jbi.2021.103689. Epub 2021 Feb 4.
Learning causal effects from observational data, e.g. estimating the effect of a treatment on survival by data-mining electronic health records (EHRs), can be biased due to unmeasured confounders, mediators, and colliders. When the causal dependencies among features/covariates are expressed in the form of a directed acyclic graph, using do-calculus it is possible to identify one or more adjustment sets for eliminating the bias on a given causal query under certain assumptions. However, prior knowledge of the causal structure might be only partial; algorithms for causal structure discovery often provide ambiguous solutions, and their computational complexity becomes practically intractable when the feature sets grow large. We hypothesize that the estimation of the true causal effect of a causal query on to an outcome can be approximated as an ensemble of lower complexity estimators, namely bagged random causal networks. A bagged random causal network is an ensemble of subnetworks constructed by sampling the feature subspaces (with the query, the outcome, and a random number of other features), drawing conditional dependencies among the features, and inferring the corresponding adjustment sets. The causal effect can be then estimated by any regression function of the outcome by the query paired with the adjustment sets. Through simulations and a real-world clinical dataset (class III malocclusion data), we show that the bagged estimator is -in most cases- consistent with the true causal effect if the structure is known, has a good variance/bias trade-off when the structure is unknown (estimated using heuristics), has lower computational complexity than learning a full network, and outperforms boosted regression. In conclusion, the bagged random causal network is well-suited to estimate query-target causal effects from observational studies on EHR and other high-dimensional biomedical databases.
从观察数据中学习因果效应,例如通过挖掘电子健康记录(EHR)来估计治疗对生存的影响,可能会因未测量的混杂因素、中介因素和对撞因素而产生偏差。当特征/协变量之间的因果依赖关系以有向无环图的形式表示时,在某些假设下,使用do-演算可以识别一个或多个调整集,以消除给定因果查询上的偏差。然而,因果结构的先验知识可能只是部分的;因果结构发现算法通常会提供模糊的解决方案,并且当特征集变大时,它们的计算复杂度实际上会变得难以处理。我们假设,因果查询对结果的真实因果效应的估计可以近似为较低复杂度估计器的集合,即袋装随机因果网络。袋装随机因果网络是通过对特征子空间(包括查询、结果和随机数量的其他特征)进行采样、绘制特征之间的条件依赖关系并推断相应的调整集而构建的子网络的集合。然后,可以通过查询与调整集配对的结果的任何回归函数来估计因果效应。通过模拟和一个真实世界的临床数据集(III类错牙合畸形数据),我们表明,如果结构已知,袋装估计器在大多数情况下与真实因果效应一致;当结构未知时(使用启发式方法估计),它具有良好的方差/偏差权衡;其计算复杂度低于学习完整网络,并且优于增强回归。总之,袋装随机因果网络非常适合从对EHR和其他高维生物医学数据库的观察性研究中估计查询-目标因果效应。