Tang Lu, Zhou Ling, Song Peter X-K
Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, Sichuan, China.
J Multivar Anal. 2020 Mar;176. doi: 10.1016/j.jmva.2019.104567. Epub 2019 Nov 28.
We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., ≫ , in the generalized linear models framework. When such datasets are too big to be analyzed entirely by a single centralized computer, or when datasets are already stored in distributed database systems, the strategy of divide-and-combine has been the method of choice for scalability. Due to partition, the sub-dataset sample sizes may be uneven and some possibly close to , which calls for regularization techniques to improve numerical stability. However, there is a lack of clear theoretical justification and practical guidelines to combine results obtained from separate regularized estimators, especially when the final objective is simultaneous inference for a group of regression parameters. In this paper, we develop a strategy to combine bias-corrected lasso-type estimates by using confidence distributions. We show that the resulting combined estimator achieves the same estimation efficiency as that of the maximum likelihood estimator using the centralized data. As demonstrated by simulated and real data examples, our divide-and-combine method yields nearly identical inference as the centralized benchmark.
我们提出了一种分布式方法,用于在广义线性模型框架下对样本量远大于协变量数量(即(n\gg p))的数据集进行同时推断。当此类数据集太大而无法由单个集中式计算机完全分析时,或者当数据集已经存储在分布式数据库系统中时,分而治之的策略一直是实现可扩展性的首选方法。由于分区,子数据集的样本量可能不均匀,有些可能接近(p),这就需要正则化技术来提高数值稳定性。然而,缺乏明确的理论依据和实用指南来合并从单独的正则化估计器获得的结果,尤其是当最终目标是对一组回归参数进行同时推断时。在本文中,我们开发了一种通过使用置信分布来合并偏差校正后的套索型估计的策略。我们表明,由此产生的组合估计器与使用集中式数据的最大似然估计器具有相同的估计效率。如模拟和实际数据示例所示,我们的分而治之方法产生的推断与集中式基准几乎相同。