Department of Epidemiology and Biostatistics, Imperial College London, London, United Kingdom; Statistical Consulting Group, GlaxoSmithKline, Stevenage, United Kingdom.
Genet Epidemiol. 2013 Nov;37(7):704-14. doi: 10.1002/gepi.21750. Epub 2013 Jul 26.
To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known-associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high-dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.
迄今为止,已经发现许多遗传变异与各种表型特征有关。然而,已确定的关联通常仅能解释特征遗传率的一小部分,并且仅包含已知相关变异的模型的预测能力也很小。多元回归是一个常用的框架,可以同时考虑许多遗传变异的联合效应。由于数据的高维度和预测变量之间的相关结构,普通多元回归在遗传数据的背景下很少适用。已经重新兴起了使用惩罚回归技术来规避这些困难的兴趣。在本文中,我们专注于岭回归,这是一种惩罚回归方法,已被证明在多元预测问题中具有良好的性能。岭回归应用中的一个挑战是选择控制回归系数收缩量的岭参数。我们提出了一种基于数据确定岭参数的方法,旨在在高维预测问题中取得良好的性能。我们为我们的方法提供了理论依据,并在模拟遗传数据和真实数据示例上证明了其性能。同时拟合数十万到数百万个遗传变异的岭回归模型会带来计算上的挑战。我们已经开发了一个名为 ridge 的 R 包,可以解决这些问题。Ridge 实现了本文中提出的自动选择岭参数的方法,并可从 CRAN 免费获得。