Tesei Gino, Giampanis Stefanos, Shi Jingpu, Norgeot Beau
Elevance Health, Palo Alto, CA 94301, USA.
J Biomed Inform. 2023 Apr;140:104339. doi: 10.1016/j.jbi.2023.104339. Epub 2023 Mar 20.
A causal effect can be defined as a comparison of outcomes that result from two or more alternative actions, with only one of the action-outcome pairs actually being observed. In healthcare, the gold standard for causal effect measurements is randomized controlled trials (RCTs), in which a target population is explicitly defined and each study sample is randomly assigned to either the treatment or control cohorts. The great potential to derive actionable insights from causal relationships has led to a growing body of machine-learning research applying causal effect estimators to observational data in the fields of healthcare, education, and economics. The primary difference between causal effect studies utilizing observational data and RCTs is that for observational data, the study occurs after the treatment, and therefore we do not have control over the treatment assignment mechanism. This can lead to massive differences in covariate distributions between control and treatment samples, making a comparison of causal effects confounded and unreliable. Classical approaches have sought to solve this problem piecemeal, first by predicting treatment assignment and then treatment effect separately. Recent work extended part of these approaches to a new family of representation-learning algorithms, showing that the upper bound of the expected treatment effect estimation error is determined by two factors: the outcome generalization-error of the representation and the distance between the treated and control distributions induced by the representation. To achieve minimal dissimilarity in learning such distributions, in this work we propose a specific auto-balancing, self-supervised objective. Experiments on real and benchmark datasets revealed that our approach consistently produced less biased estimates than previously published state-of-the-art methods. We demonstrate that the reduction in error can be directly attributed to the ability to learn representations that explicitly reduce such dissimilarity; further, in case of violations of the positivity assumption (frequent in observational data), we show our approach performs significantly better than the previous state of the art. Thus, by learning representations that induce similar distributions of the treated and control cohorts, we present evidence to support the error bound dissimilarity hypothesis as well as providing a new state-of-the-art model for causal effect estimation.
因果效应可以定义为对两个或更多替代行动所产生的结果进行比较,而实际上只观察到其中一个行动-结果对。在医疗保健领域,因果效应测量的金标准是随机对照试验(RCT),其中明确界定了目标人群,并且将每个研究样本随机分配到治疗组或对照组。从因果关系中获得可操作见解的巨大潜力,促使越来越多的机器学习研究将因果效应估计器应用于医疗保健、教育和经济领域的观测数据。利用观测数据的因果效应研究与随机对照试验的主要区别在于,对于观测数据,研究在治疗后进行,因此我们无法控制治疗分配机制。这可能导致对照组和治疗组样本的协变量分布存在巨大差异,使得因果效应的比较变得混淆且不可靠。经典方法试图逐个解决这个问题,首先预测治疗分配,然后分别预测治疗效果。最近的工作将这些方法的一部分扩展到了一个新的表示学习算法家族,表明预期治疗效果估计误差的上限由两个因素决定:表示的结果泛化误差以及由该表示引起的治疗组和对照组分布之间的距离。为了在学习此类分布时实现最小化差异,在这项工作中我们提出了一个特定的自动平衡、自监督目标。在真实数据集和基准数据集上的实验表明,我们的方法始终比先前发表的最先进方法产生的偏差估计更少。我们证明误差的减少可直接归因于学习能够明确减少此类差异的表示的能力;此外,在违反正性假设的情况下(在观测数据中经常出现),我们表明我们的方法比先前的最先进方法表现得明显更好。因此,通过学习能使治疗组和对照组产生相似分布的表示,我们提供了证据来支持误差界差异假设,并为因果效应估计提供了一个新的最先进模型。