高效实现基于惩罚回归的遗传风险预测。

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.

机构信息

Laboratoire TIMC-IMAG, UMR 5525, University of Grenoble Alpes, CNRS, 38700 La Tronche, France

Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, 75015 Paris, France.

出版信息

Genetics. 2019 May;212(1):65-74. doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

DOI:10.1534/genetics.119.302019

PMID:30808621

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6499521/

Abstract

Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.

摘要

多基因风险评分 (PRS) 结合了许多单核苷酸多态性 (SNP) 的基因型信息，得出一个反映患病遗传风险的分数。PRS 可能对公共卫生产生重大影响，可能允许进行筛查活动，以确定给定疾病的高遗传风险个体。“聚类+阈值”(C+T) 方法是衍生 PRS 最常用的方法。C+T 仅使用单变量全基因组关联研究 (GWAS) 汇总统计数据，这使得它快速且易于使用。然而，之前的工作表明，联合估计 SNP 效应来计算 PRS 有可能显著提高 PRS 的预测性能，与 C+T 相比。在本文中，我们提出了一种使用个体水平数据联合估计 SNP 效应的有效方法，允许在包括数十万人的现代数据集上应用惩罚逻辑回归 (PLR)，具有实际应用价值。此外，我们的 PLR 实现直接包括了超参数的自动选择。我们还提供了一种用于定量性状的惩罚线性回归的实现。我们使用真实数据和模拟数据比较了 PLR、C+T 和随机森林的衍生方法的性能。总的来说，我们发现 PLR 在大多数考虑的情况下达到或超过了 C+T 的预测性能，同时具有可扩展性，可以应用于生物库数据。特别是，我们发现当位于具有相关 SNP 的附近基因组区域中的效应较少时，预测性能的提高更为明显；例如，在模拟中，AUC 值从 C+T 的最佳预测的 83%增加到 PLR 的 92.5%。我们在一项乳糜泻病例对照研究的数据分析中证实了这些结果，其中 PLR 和标准 C+T 方法的 AUC 值分别为 89%和 82.5%。在英国生物库的 35 万名个体中应用惩罚线性回归，我们预测的身高相关性比 C+T 的最佳预测更高（约 65%，而不是约 55%），进一步证明了它的可扩展性和强大的预测能力，即使是对高度多基因性状也是如此。此外，我们使用英国生物库的 15 万名个体，仅用几分钟的时间就能比 C+T 更好地预测乳腺癌。总之，本文通过我们的 R 包 bigstatsr 中提供的高效实现，证明了当有大量个体水平数据集可用时，使用惩罚回归进行 PRS 计算的可行性和相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/667e/6499521/5d04d37cfaa2/65f1.jpg

相似文献

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.

Genetics. 2019 May;212(1):65-74. doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

Making the Most of Clumping and Thresholding for Polygenic Scores.

Am J Hum Genet. 2019 Dec 5;105(6):1213-1221. doi: 10.1016/j.ajhg.2019.11.001. Epub 2019 Nov 21.

Fast and scalable ensemble learning method for versatile polygenic risk prediction.

Proc Natl Acad Sci U S A. 2024 Aug 13;121(33):e2403210121. doi: 10.1073/pnas.2403210121. Epub 2024 Aug 7.

Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.

Am J Hum Genet. 2021 Jun 3;108(6):1001-1011. doi: 10.1016/j.ajhg.2021.04.014. Epub 2021 May 7.

Development of a Polygenic Risk Score for Metabolic Dysfunction-Associated Steatotic Liver Disease Prediction in UK Biobank.

Genes (Basel). 2024 Dec 28;16(1):33. doi: 10.3390/genes16010033.

Optimization of multi-ancestry polygenic risk score disease prediction models.

Sci Rep. 2025 May 20;15(1):17495. doi: 10.1038/s41598-025-02903-1.

netCRS: Network-based comorbidity risk score for prediction of myocardial infarction using biobank-scaled PheWAS data.

Pac Symp Biocomput. 2022;27:325-336.

Imputed gene expression risk scores: a functionally informed component of polygenic risk.

Hum Mol Genet. 2021 May 17;30(8):727-738. doi: 10.1093/hmg/ddab053.

Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes.

Nat Commun. 2019 Feb 4;10(1):569. doi: 10.1038/s41467-019-08535-0.

Fast and accurate Bayesian polygenic risk modeling with variational inference.

Am J Hum Genet. 2023 May 4;110(5):741-761. doi: 10.1016/j.ajhg.2023.03.009. Epub 2023 Apr 7.

引用本文的文献

An Efficient Lasso Framework for Admixture-Aware Polygenic Scores.

bioRxiv. 2025 Aug 27:2025.08.26.671106. doi: 10.1101/2025.08.26.671106.

Bridging Genomics to Cardiology Clinical Practice: Artificial Intelligence in Optimizing Polygenic Risk Scores: A Systematic Review.

JACC Adv. 2025 Jun;4(6 Pt 2):101803. doi: 10.1016/j.jacadv.2025.101803.

Deep learning-based polygenic scores enhance generalizability of psychiatric disorders prediction.

medRxiv. 2025 May 5:2025.05.05.25326794. doi: 10.1101/2025.05.05.25326794.

Biobanks in GENETICS and G3: tackling the statistical challenges.

Genetics. 2025 Apr 17;229(4). doi: 10.1093/genetics/iyaf046.

Biobanks in GENETICS and G3: tackling the statistical challenges.

G3 (Bethesda). 2025 Apr 17;15(4). doi: 10.1093/g3journal/jkaf060.

Efficient blockLASSO for polygenic scores with applications to all of us and UK Biobank.

BMC Genomics. 2025 Mar 27;26(1):302. doi: 10.1186/s12864-025-11505-0.

A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance.

J Am Stat Assoc. 2024;119(548):2585-2597. doi: 10.1080/01621459.2023.2261669. Epub 2023 Nov 14.

Leveraging haplotype information in heritability estimation and polygenic prediction.

Nat Commun. 2025 Jan 2;16(1):126. doi: 10.1038/s41467-024-55477-3.

Characterizing the genetic architecture of drug response using gene-context interaction methods.

Cell Genom. 2024 Dec 11;4(12):100722. doi: 10.1016/j.xgen.2024.100722. Epub 2024 Dec 4.

Assessing the predictive efficacy of European-based systolic blood pressure polygenic risk scores in diverse Brazilian cohorts.

Sci Rep. 2024 Nov 15;14(1):28123. doi: 10.1038/s41598-024-79683-7.

本文引用的文献

Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes.

Am J Hum Genet. 2019 Jan 3;104(1):21-34. doi: 10.1016/j.ajhg.2018.11.002. Epub 2018 Dec 13.

The UK Biobank resource with deep phenotyping and genomic data.

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

Accurate Genomic Prediction of Human Height.

Genetics. 2018 Oct;210(2):477-497. doi: 10.1534/genetics.118.301267. Epub 2018 Aug 27.

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

Bioinformatics. 2018 Aug 15;34(16):2781-2787. doi: 10.1093/bioinformatics/bty185.

Multiethnic polygenic risk scores improve risk prediction in diverse populations.

Genet Epidemiol. 2017 Dec;41(8):811-823. doi: 10.1002/gepi.22083. Epub 2017 Nov 7.

Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations.

Am J Hum Genet. 2017 Apr 6;100(4):635-649. doi: 10.1016/j.ajhg.2017.03.004. Epub 2017 Mar 30.

Developing and evaluating polygenic risk prediction models for stratified disease prevention.

Nat Rev Genet. 2016 Jul;17(7):392-406. doi: 10.1038/nrg.2016.27. Epub 2016 May 3.

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.

Am J Hum Genet. 2015 Oct 1;97(4):576-92. doi: 10.1016/j.ajhg.2015.09.001.

Implications of polygenic risk-stratified screening for prostate cancer on overdiagnosis.

Genet Med. 2015 Oct;17(10):789-95. doi: 10.1038/gim.2014.192. Epub 2015 Jan 8.

Strong rules for discarding predictors in lasso-type problems.

J R Stat Soc Series B Stat Methodol. 2012 Mar;74(2):245-266. doi: 10.1111/j.1467-9868.2011.01004.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

高效实现基于惩罚回归的遗传风险预测。

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献