Klinkhammer Hannah, Staerk Christian, Maj Carlo, Krawitz Peter Michael, Mayr Andreas
Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Bonn, Germany.
Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany.
Front Genet. 2023 Jan 10;13:1076440. doi: 10.3389/fgene.2022.1076440. eCollection 2022.
Polygenic risk scores (PRS) evaluate the individual genetic liability to a certain trait and are expected to play an increasingly important role in clinical risk stratification. Most often, PRS are estimated based on summary statistics of univariate effects derived from genome-wide association studies. To improve the predictive performance of PRS, it is desirable to fit multivariable models directly on the genetic data. Due to the large and high-dimensional data, a direct application of existing methods is often not feasible and new efficient algorithms are required to overcome the computational burden regarding efficiency and memory demands. We develop an adapted component-wise -boosting algorithm to fit genotype data from large cohort studies to continuous outcomes using linear base-learners for the genetic variants. Similar to the snpnet approach implementing lasso regression, the proposed snpboost approach iteratively works on smaller batches of variants. By restricting the set of possible base-learners in each boosting step to variants most correlated with the residuals from previous iterations, the computational efficiency can be substantially increased without losing prediction accuracy. Furthermore, for large-scale data based on various traits from the UK Biobank we show that our method yields competitive prediction accuracy and computational efficiency compared to the snpnet approach and further commonly used methods. Due to the modular structure of boosting, our framework can be further extended to construct PRS for different outcome data and effect types-we illustrate this for the prediction of binary traits.
多基因风险评分(PRS)评估个体对某一性状的遗传易感性,并有望在临床风险分层中发挥越来越重要的作用。通常,PRS是基于全基因组关联研究得出的单变量效应的汇总统计数据来估计的。为了提高PRS的预测性能,直接在遗传数据上拟合多变量模型是很有必要的。由于数据量大且维度高,直接应用现有方法往往不可行,需要新的高效算法来克服效率和内存需求方面的计算负担。我们开发了一种适应性的逐分量提升算法,使用针对遗传变异的线性基学习器,将大型队列研究中的基因型数据拟合到连续结局上。与实施套索回归的snpnet方法类似,所提出的snpboost方法在较小的变异批次上迭代工作。通过在每个提升步骤中将可能的基学习器集限制为与先前迭代的残差最相关的变异,可以在不损失预测准确性的情况下大幅提高计算效率。此外,对于基于英国生物银行各种性状的大规模数据,我们表明与snpnet方法和其他常用方法相比,我们的方法具有有竞争力的预测准确性和计算效率。由于提升的模块化结构,我们的框架可以进一步扩展以构建针对不同结局数据和效应类型的PRS——我们以二元性状的预测为例进行了说明。