School of Public Health, Yale University, New Haven, CT 06520, USA.
Biostatistics. 2013 Apr;14(2):205-19. doi: 10.1093/biostatistics/kxs034. Epub 2012 Sep 17.
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize-minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
在全基因组关联研究中,惩罚是识别与疾病相关的遗传标记的重要方法。受单核苷酸多态性中存在自然分组结构的事实的启发,更重要的是,这些组是相关的,我们提出了一种新的惩罚方法,用于组变量选择,该方法可以适当适应相邻组之间的相关性。该方法基于组 Lasso 惩罚和相邻组回归系数差的二次惩罚的组合。新方法称为平滑组 Lasso(SGL)。它鼓励组稀疏并平滑相邻组的回归系数。典型相关应用于二次差分惩罚中组间的权重。我们首先为线性回归模型推导了一种计算解路径的 GCD 算法。SGL 方法进一步扩展到二项响应的逻辑回归。借助于极大似然算法,SGL 惩罚逻辑回归变成了一个迭代惩罚最小二乘问题。我们还建议进行主成分分析以降低组内的维数。模拟研究用于评估有限样本性能。与组 Lasso 的比较表明,SGL 在选择真阳性方面更有效。使用 SGL 方法分析了两个数据集。