Fang Kuangnan, Ma Shuangge
Department of Statistics, Xiamen University, Xiamen, Fujian, China.
Department of Biostatistics, Yale University, New Haven, CT, 06520, USA.
Biom J. 2017 Mar;59(2):358-376. doi: 10.1002/bimj.201600052. Epub 2016 Nov 21.
Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a "big computer" with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a "small computer". The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.
如今,我们经常会遇到具有大量协变量(p值较大)和/或大样本量(n值较大)的数据。对于许多问题,在估计和变量选择时会采用正则化方法,尤其是惩罚方法。将惩罚方法直接应用于大型数据集需要一台具有高计算能力的“大型计算机”。为了提高计算的可行性,我们开发了自助惩罚法,它将一个大型惩罚估计分解为一组小型估计,这些小型估计可以高度并行地执行,并且每个只需要一台“小型计算机”。所提出的方法针对具有不同特征的数据采用不同的策略。对于p值较大但n值较小到中等的数据,首先将协变量聚类为相对同质的块。所提出的方法包括两个连续步骤。在每个步骤中,对于每个自助样本,我们选择协变量块并进行惩罚。将多个自助样本的结果汇总以生成最终估计。对于n值较大但p值较小到中等的数据,我们对少量个体进行自助抽样,应用惩罚估计,然后对多个自助样本进行加权平均。对于p值和n值都较大的数据,则应用前两种方法的自然结合。数值研究,包括模拟和数据分析,表明所提出的方法在计算和数值方面比直接应用惩罚方法具有优势。我们已经开发了一个R包来实现所提出的方法。