Institute of Human Genetics, Central Parkway, Newcastle upon Tyne, United Kingdom.
Genet Epidemiol. 2010 Dec;34(8):879-91. doi: 10.1002/gepi.20543.
Penalized regression methods offer an attractive alternative to single marker testing in genetic association analysis. Penalized regression methods shrink down to zero the coefficient of markers that have little apparent effect on the trait of interest, resulting in a parsimonious subset of what we hope are true pertinent predictors. Here we explore the performance of penalization in selecting SNPs as predictors in genetic association studies. The strength of the penalty can be chosen either to select a good predictive model (via methods such as computationally expensive cross validation), through maximum likelihood-based model selection criterion (such as the BIC), or to select a model that controls for type I error, as done here. We have investigated the performance of several penalized logistic regression approaches, simulating data under a variety of disease locus effect size and linkage disequilibrium patterns. We compared several penalties, including the elastic net, ridge, Lasso, MCP and the normal-exponential-γ shrinkage prior implemented in the hyperlasso software, to standard single locus analysis and simple forward stepwise regression. We examined how markers enter the model as penalties and P-value thresholds are varied, and report the sensitivity and specificity of each of the methods. Results show that penalized methods outperform single marker analysis, with the main difference being that penalized methods allow the simultaneous inclusion of a number of markers, and generally do not allow correlated variables to enter the model, producing a sparse model in which most of the identified explanatory markers are accounted for.
惩罚回归方法为遗传关联分析中的单标记测试提供了一种有吸引力的替代方法。惩罚回归方法会将对目标性状几乎没有明显影响的标记的系数缩小到零,从而形成一个简约的子集,其中包含我们希望真正相关的预测因子。在这里,我们探讨了惩罚在选择 SNP 作为遗传关联研究中的预测因子方面的性能。惩罚的强度可以通过选择一个好的预测模型来选择(例如通过计算成本高昂的交叉验证等方法),通过基于最大似然的模型选择标准(例如 BIC),或者像这里一样选择一个控制第一类错误的模型。我们研究了几种惩罚逻辑回归方法的性能,模拟了各种疾病基因座效应大小和连锁不平衡模式下的数据。我们比较了几种惩罚,包括弹性网络、岭回归、Lasso、MCP 和在 hyperlasso 软件中实现的正态-指数-γ收缩先验,与标准单基因座分析和简单的向前逐步回归。我们研究了随着惩罚和 P 值阈值的变化,标记如何进入模型,并报告了每种方法的敏感性和特异性。结果表明,惩罚方法优于单标记分析,主要区别在于惩罚方法允许同时包含多个标记,并且通常不允许相关变量进入模型,从而产生一个稀疏的模型,其中大部分识别出的解释标记都被考虑在内。