Tong Jiayi, Hu Jie, Hripcsak George, Ning Yang, Chen Yong
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
Department of Biomedical Informatics, Columbia University, New York, NY 10027, USA.
J Mach Learn Res. 2025;26.
High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisCo-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we apply the algorithm to a real-world data set to present the readiness of implementation and validity.
高维医疗数据,如电子健康记录(EHR)数据和理赔数据,由于变量数量众多以及需要整合来自多个临床站点的数据,带来了两个主要挑战。第三个关键挑战是协变量转移方面可能存在的异质性。在本文中,我们提出了一种考虑协变量转移的分布式学习算法,用于估计高维数据的平均治疗效果(ATE),名为DisCo-HD。利用替代似然方法,我们的方法校准倾向得分和结果模型的估计值,以近似达到所需的协变量平衡特性,同时考虑多个临床站点之间的协变量转移。我们表明,我们的分布式协变量平衡倾向得分估计器可以近似通过将多个站点的数据集中在一起获得的合并估计器。如果倾向得分模型或结果回归模型被正确指定,所提出的估计器仍然是一致的。当倾向得分和结果模型都被正确指定时,可达到半参数效率界。我们进行模拟研究以证明所提出算法的性能;此外,我们将该算法应用于一个真实世界的数据集,以展示其实施的准备情况和有效性。