Suppr超能文献

高效实现基于惩罚回归的遗传风险预测。

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.

机构信息

Laboratoire TIMC-IMAG, UMR 5525, University of Grenoble Alpes, CNRS, 38700 La Tronche, France

Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, 75015 Paris, France.

出版信息

Genetics. 2019 May;212(1):65-74. doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

Abstract

Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.

摘要

多基因风险评分 (PRS) 结合了许多单核苷酸多态性 (SNP) 的基因型信息,得出一个反映患病遗传风险的分数。PRS 可能对公共卫生产生重大影响,可能允许进行筛查活动,以确定给定疾病的高遗传风险个体。“聚类+阈值”(C+T) 方法是衍生 PRS 最常用的方法。C+T 仅使用单变量全基因组关联研究 (GWAS) 汇总统计数据,这使得它快速且易于使用。然而,之前的工作表明,联合估计 SNP 效应来计算 PRS 有可能显著提高 PRS 的预测性能,与 C+T 相比。在本文中,我们提出了一种使用个体水平数据联合估计 SNP 效应的有效方法,允许在包括数十万人的现代数据集上应用惩罚逻辑回归 (PLR),具有实际应用价值。此外,我们的 PLR 实现直接包括了超参数的自动选择。我们还提供了一种用于定量性状的惩罚线性回归的实现。我们使用真实数据和模拟数据比较了 PLR、C+T 和随机森林的衍生方法的性能。总的来说,我们发现 PLR 在大多数考虑的情况下达到或超过了 C+T 的预测性能,同时具有可扩展性,可以应用于生物库数据。特别是,我们发现当位于具有相关 SNP 的附近基因组区域中的效应较少时,预测性能的提高更为明显;例如,在模拟中,AUC 值从 C+T 的最佳预测的 83%增加到 PLR 的 92.5%。我们在一项乳糜泻病例对照研究的数据分析中证实了这些结果,其中 PLR 和标准 C+T 方法的 AUC 值分别为 89%和 82.5%。在英国生物库的 35 万名个体中应用惩罚线性回归,我们预测的身高相关性比 C+T 的最佳预测更高(约 65%,而不是约 55%),进一步证明了它的可扩展性和强大的预测能力,即使是对高度多基因性状也是如此。此外,我们使用英国生物库的 15 万名个体,仅用几分钟的时间就能比 C+T 更好地预测乳腺癌。总之,本文通过我们的 R 包 bigstatsr 中提供的高效实现,证明了当有大量个体水平数据集可用时,使用惩罚回归进行 PRS 计算的可行性和相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/667e/6499521/5d04d37cfaa2/65f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验