Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215.
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20814.
Proc Natl Acad Sci U S A. 2024 Aug 13;121(33):e2403210121. doi: 10.1073/pnas.2403210121. Epub 2024 Aug 7.
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.
多基因风险评分 (PRS) 增强了人群风险分层并推进了个性化医学,但现有的方法面临着几个局限性,包括与计算负担、预测准确性以及对广泛遗传结构的适应性相关的问题。为了解决这些问题,我们提出了使用汇总数据 (ALL-Sum) 的聚合 L0Learn 方法,这是一种使用全基因组关联研究 (GWAS) 的汇总统计信息计算 PRS 的快速且可扩展的集成学习方法。ALL-Sum 利用 L0L2 惩罚回归和跨调参的集成学习,灵活地对具有不同遗传结构的性状进行建模。在广泛的遗传多态性和 GWAS 样本量范围内的大量模拟中,ALL-Sum 在预测准确性、运行时和内存使用方面分别比流行的替代方法平均提高了 10%、20 倍和 3 倍,并且表现出对不同遗传结构的稳健性。我们使用来自九个数据源的 GWAS 汇总统计信息,包括全球脂质遗传学联盟、乳腺癌协会联盟和芬兰基因生物银行,在 UK Biobank 中进行了验证,在 11 种复杂性状的真实数据分析中验证了 ALL-Sum 的性能。我们的结果表明,平均而言,ALL-Sum 获得的 PRS 平均准确率提高了 25%,计算速度提高了 15 倍,内存使用量减少了一半,并且在广泛的性状和疾病中具有稳健的性能。此外,我们的方法在使用来自不同数据源的连锁不平衡计算时表现出稳定的预测。ALL-Sum 是一个用户友好的 R 软件包,提供了公共参考数据,可实现简化的分析。