Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, UK.
BMC Bioinformatics. 2011 Sep 19;12:372. doi: 10.1186/1471-2105-12-372.
Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient.
We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition.
The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
技术的发展增加了大规模遗传关联研究的可行性。使用 SNP 芯片、下一代测序技术和插补方法可以获得高密度的遗传标记。然而,由于这些标记之间存在连锁不平衡,因此使用这些方法获得的 SNP 可能高度相关,并且由于这些数据集的高维性和相关性结构,标准的多元回归技术无法处理这些数据集。人们越来越感兴趣地使用惩罚回归分析高维数据。岭回归是一种惩罚回归技术,它不进行变量选择,而是为每个预测变量估计一个回归系数。因此,获得每个岭回归系数的显著性估计是很有必要的。
我们开发并评估了一种用于岭回归系数的显著性检验方法。通过模拟研究,我们证明了该检验方法的性能与置换检验相当,具有计算成本大大降低的优势。我们引入了 p 值轨迹,这是一个随着收缩参数增加而绘制岭回归系数的负对数 p 值的图,它能够直观地显示随着惩罚的增加,回归系数的 p 值的变化。我们将所提出的方法应用于 EPIC(欧洲癌症前瞻性调查和营养)中的肺癌病例对照数据集。
与置换检验相比,所提出的检验方法是估计岭回归系数显著性的有用替代方法,计算成本大大降低。p 值轨迹是一种直观的图形工具,用于评估随着收缩参数增加,岭回归系数显著性检验的结果,并且提出的检验方法使其在计算上变得可行。