Chen Ting-Huei, Sun Wei, Fine Jason P
Department of Mathematics and Statistics, Laval University, Quebec, QC G1V0A6, Canada.
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA.
Electron J Stat. 2016;10(2):2312-2328. doi: 10.1214/16-EJS1169. Epub 2016 Aug 29.
Various forms of penalty functions have been developed for regularized estimation and variable selection. Screening approaches are often used to reduce the number of covariate before penalized estimation. However, in certain problems, the number of covariates remains large after screening. For example, in genome-wide association (GWA) studies, the purpose is to identify Single Nucleotide Polymorphisms (SNPs) that are associated with certain traits, and typically there are millions of SNPs and thousands of samples. Because of the strong correlation of nearby SNPs, screening can only reduce the number of SNPs from millions to tens of thousands and the variable selection problem remains very challenging. Several penalty functions have been proposed for such high dimensional data. However, it is unclear which class of penalty functions is the appropriate choice for a particular application. In this paper, we conduct a theoretical analysis to relate the ranges of tuning parameters of various penalty functions with the dimensionality of the problem and the minimum effect size. We exemplify our theoretical results in several penalty functions. The results suggest that a class of penalty functions that bridges and penalties requires less restrictive conditions on dimensionality and minimum effect sizes in order to attain the two fundamental goals of penalized estimation: to penalize all the noise to be zero and to obtain unbiased estimation of the true signals. The penalties such as SICA and Log belong to this class, but they have not been used often in applications. The simulation and real data analysis using GWAS data suggest the promising applicability of such class of penalties.
为了进行正则化估计和变量选择,人们开发了各种形式的惩罚函数。筛选方法通常用于在惩罚估计之前减少协变量的数量。然而,在某些问题中,筛选后协变量的数量仍然很大。例如,在全基因组关联(GWA)研究中,目的是识别与某些性状相关的单核苷酸多态性(SNP),通常有数百万个SNP和数千个样本。由于附近SNP的强相关性,筛选只能将SNP的数量从数百万减少到数万,变量选择问题仍然非常具有挑战性。针对此类高维数据,已经提出了几种惩罚函数。然而,尚不清楚哪类惩罚函数是特定应用的合适选择。在本文中,我们进行了理论分析,以将各种惩罚函数的调优参数范围与问题的维度和最小效应大小联系起来。我们在几个惩罚函数中举例说明了我们的理论结果。结果表明,一类桥接 和 惩罚的惩罚函数在维度和最小效应大小方面需要较少的限制条件,以便实现惩罚估计的两个基本目标:将所有噪声惩罚为零,并获得真实信号的无偏估计。诸如SICA和Log之类的惩罚属于此类,但它们在应用中并不经常使用。使用GWAS数据的模拟和实际数据分析表明了这类惩罚的应用前景。