Zabad Shadi, Haryan Chirayu Anant, Gravel Simon, Misra Sanchit, Li Yue
School of Computer Science, McGill University, Montreal, QC, Canada.
Parallel Computing Lab, Intel Labs, Bangalore, Karnataka, India.
Am J Hum Genet. 2025 May 20. doi: 10.1016/j.ajhg.2025.05.002.
With improved whole-genome sequencing and variant imputation techniques, modern genome-wide association studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRSs) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly used PRS methods in this setting either perform heuristic pruning and thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary-statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed linkage-disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate-ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the proposed algorithms yield over 50-fold reductions in storage requirements and lead to orders-of-magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. These experiments demonstrated that VIPRS is 1-2 orders of magnitude more efficient than popular baselines while being competitive with the best-performing methods in terms of prediction accuracy.
随着全基因组测序和变异插补技术的改进,现代全基因组关联研究(GWAS)丰富了我们对数千种疾病表型的遗传关联格局的理解。然而,将数百万个遗传变异的边际关联转化为能够捕捉它们对表型联合效应的综合多基因风险评分(PRS)仍然是一项重大挑战。由于技术和统计限制,在这种情况下常用的PRS方法要么进行启发式修剪和阈值设定,要么通过将推理限制在小变异集(如HapMap3)上而忽略了大多数遗传关联信号。在这里,我们提出了一组算法改进和紧凑的数据结构,能够将基于汇总统计的PRS推理扩展到数千万个变异,同时避免在这种高维设置中常见的数值不稳定性。这些增强包括高度压缩的连锁不平衡(LD)矩阵格式,它与简化的并行坐标上升更新方案相结合。当纳入我们现有的PRS方法(VIPRS)时,所提出的算法使存储需求减少了50倍以上,并在运行时和内存效率上实现了数量级的提升。更新后的VIPRS软件现在可以在一分钟内对超过110万个HapMap3变异进行变分贝叶斯回归。使用这种可扩展的实现方式,我们将VIPRS应用于英国生物银行中75种最具遗传性的连续表型,利用了多达1800万个双等位基因变异的边际关联。这些实验表明,VIPRS比流行的基线方法效率高1 - 2个数量级,同时在预测准确性方面与表现最佳的方法具有竞争力。