Minnier Jessica, Tian Lu, Cai Tianxi
Ph.D. candidate, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115.
J Am Stat Assoc. 2011 Jan 1;106(496):1371-1382. doi: 10.1198/jasa.2011.tm10382. Epub 2012 Jan 24.
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.
对高维数据的分析通常旨在识别重要特征的子集,并评估它们对结果的影响。基于标准回归方法的传统统计推断程序在存在高维特征时往往会失效。近年来,正则化方法已成为分析高维数据的有前途的工具。这些方法同时选择重要特征并对其影响提供稳定的估计。例如,自适应LASSO和SCAD给出了具有神谕性质的一致且渐近正态的估计。然而,在有限样本中,仍然难以获得回归参数的区间估计。在本文中,我们提出基于扰动重采样的程序来近似一类一般惩罚参数估计的分布。我们的提议由渐近理论证明是合理的,它提供了一种估计协方差矩阵和置信区域的简单方法。通过有限样本模拟,我们验证了该方法进行准确推断的能力,并将其与其他广泛使用的标准差和置信区间估计进行比较。我们还用一个用于研究HIV耐药性与大量基因突变关联的数据集来说明我们的提议。