Pan Wei, Xie Benhuai, Shen Xiaotong
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA.
Biometrics. 2010 Jun;66(2):474-84. doi: 10.1111/j.1541-0420.2009.01296.x. Epub 2009 Jul 23.
We consider penalized linear regression, especially for "large p, small n" problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the L(gamma)-norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the gamma and some weights inside the L(gamma)-norm. Simulation studies demonstrate the superior finite-sample performance of the proposed method as compared to Lasso, elastic net, and a recently proposed network-based method. The new method performs best in variable selection across all simulation set-ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.
我们考虑惩罚线性回归,特别是针对“高维小样本”问题,其中预测变量之间的关系由一个网络先验描述。一类具有启发性的例子包括通过基因表达谱对表型进行建模,同时考虑以生物途径或网络形式存在的基因协同功能。为了纳入网络中相邻预测变量具有相似效应大小的先验知识,我们提出了一种基于L(γ)范数的分组惩罚,它能在网络上平滑预测变量的回归系数。所提方法的主要特点是能够自动实现分组变量选择并利用分组效应。我们还讨论了γ的选择以及L(γ)范数内一些权重的影响。模拟研究表明,与套索回归、弹性网络以及最近提出的基于网络的方法相比,所提方法具有更优的有限样本性能。在所考虑的所有模拟设置中,新方法在变量选择方面表现最佳。为作说明,该方法应用于一个微阵列数据集,使用一个基因表达数据集和一个从《京都基因与基因组百科全书》(KEGG)途径编译的基因网络来预测一些胶质母细胞瘤患者的生存时间。