使用快速且内存高效的算法进行多基因评分的全基因组推断。

Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.

作者信息

Zabad Shadi, Haryan Chirayu Anant, Gravel Simon, Misra Sanchit, Li Yue

机构信息

School of Computer Science, McGill University, Montreal, QC, Canada.

Parallel Computing Lab, Intel Labs, Bangalore, Karnataka, India.

出版信息

Am J Hum Genet. 2025 May 20. doi: 10.1016/j.ajhg.2025.05.002.

DOI:10.1016/j.ajhg.2025.05.002

PMID:40425013

Abstract

With improved whole-genome sequencing and variant imputation techniques, modern genome-wide association studies (GWASs) have enriched our understanding of the landscape of genetic associations for thousands of disease phenotypes. However, translating the marginal associations for millions of genetic variants to integrated polygenic risk scores (PRSs) that capture their joint effects on the phenotype remains a major challenge. Due to technical and statistical constraints, commonly used PRS methods in this setting either perform heuristic pruning and thresholding or overlook most genetic association signals by restricting inference to small variant sets, such as HapMap3. Here, we present a set of algorithmic improvements and compact data structures that enable scaling summary-statistics-based PRS inference to tens of millions of variants while avoiding numerical instabilities common in such high-dimensional settings. These enhancements consist of highly compressed linkage-disequilibrium (LD) matrix format, which integrates with streamlined and parallel coordinate-ascent updating schemes. When incorporated into our existing PRS method (VIPRS), the proposed algorithms yield over 50-fold reductions in storage requirements and lead to orders-of-magnitude improvements in runtime and memory efficiency. The updated VIPRS software can now perform variational Bayesian regression over 1.1 million HapMap3 variants in under a minute. Using this scalable implementation, we applied VIPRS to 75 of the most heritable, continuous phenotypes in the UK Biobank, leveraging marginal associations for up to 18 million bi-allelic variants. These experiments demonstrated that VIPRS is 1-2 orders of magnitude more efficient than popular baselines while being competitive with the best-performing methods in terms of prediction accuracy.

摘要

随着全基因组测序和变异插补技术的改进，现代全基因组关联研究（GWAS）丰富了我们对数千种疾病表型的遗传关联格局的理解。然而，将数百万个遗传变异的边际关联转化为能够捕捉它们对表型联合效应的综合多基因风险评分（PRS）仍然是一项重大挑战。由于技术和统计限制，在这种情况下常用的PRS方法要么进行启发式修剪和阈值设定，要么通过将推理限制在小变异集（如HapMap3）上而忽略了大多数遗传关联信号。在这里，我们提出了一组算法改进和紧凑的数据结构，能够将基于汇总统计的PRS推理扩展到数千万个变异，同时避免在这种高维设置中常见的数值不稳定性。这些增强包括高度压缩的连锁不平衡（LD）矩阵格式，它与简化的并行坐标上升更新方案相结合。当纳入我们现有的PRS方法（VIPRS）时，所提出的算法使存储需求减少了50倍以上，并在运行时和内存效率上实现了数量级的提升。更新后的VIPRS软件现在可以在一分钟内对超过110万个HapMap3变异进行变分贝叶斯回归。使用这种可扩展的实现方式，我们将VIPRS应用于英国生物银行中75种最具遗传性的连续表型，利用了多达1800万个双等位基因变异的边际关联。这些实验表明，VIPRS比流行的基线方法效率高1 - 2个数量级，同时在预测准确性方面与表现最佳的方法具有竞争力。

相似文献

Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.使用快速且内存高效的算法进行多基因评分的全基因组推断。

Am J Hum Genet. 2025 May 20. doi: 10.1016/j.ajhg.2025.05.002.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Cross-trait prediction accuracy of summary statistics in genome-wide association studies.全基因组关联研究中汇总统计数据的跨性状预测准确性。

Biometrics. 2023 Jun;79(2):841-853. doi: 10.1111/biom.13661. Epub 2022 Mar 30.

Polygenic risk score prediction accuracy convergence.多基因风险评分预测准确性的收敛性。

HGG Adv. 2025 May 14;6(3):100457. doi: 10.1016/j.xhgg.2025.100457.

Leveraging local ancestry and cross-ancestry genetic architecture to improve genetic prediction of complex traits in admixed populations.利用本地祖先和跨祖先遗传结构改善混合人群复杂性状的遗传预测。

Am J Hum Genet. 2025 Jul 3. doi: 10.1016/j.ajhg.2025.06.010.

Commonly used genomic arrays may lose information due to imperfect coverage of discovered variants for autism spectrum disorder.常用的基因组芯片可能会因为对自闭症谱系障碍发现的变异覆盖不完美而丢失信息。

J Neurodev Disord. 2024 Sep 12;16(1):54. doi: 10.1186/s11689-024-09571-8.

Robust pleiotropy-decomposed polygenic scores identify distinct contributions to elevated coronary artery disease polygenic risk.稳健的多效性分解多基因评分可确定对冠状动脉疾病多基因风险升高的不同贡献。

PLoS Comput Biol. 2025 Jun 26;21(6):e1013191. doi: 10.1371/journal.pcbi.1013191. eCollection 2025 Jun.

Phenome-wide association study identifies multiple traits associated with a polygenic risk score for colorectal cancer.全表型关联研究确定了与结直肠癌多基因风险评分相关的多个性状。

Hum Genomics. 2025 Jul 9;19(1):77. doi: 10.1186/s40246-025-00791-0.

Structural variants linked to Alzheimer's disease and other common age-related clinical and neuropathologic traits.与阿尔茨海默病及其他常见的年龄相关临床和神经病理学特征相关的结构变异

Genome Med. 2025 Mar 4;17(1):20. doi: 10.1186/s13073-025-01444-6.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

引用本文的文献

Analysis-ready VCF at Biobank scale using Zarr.使用Zarr在生物样本库规模上生成可供分析的VCF。

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf049.

本文引用的文献

Advancements and limitations in polygenic risk score methods for genomic prediction: a scoping review.多基因风险评分方法在基因组预测中的进展和局限性：范围综述。

Hum Genet. 2024 Dec;143(12):1401-1431. doi: 10.1007/s00439-024-02716-8. Epub 2024 Nov 14.

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries.利用功能基因组注释和基因组覆盖度提高在不同祖源内和之间的复杂性状的多基因预测。

Nat Genet. 2024 May;56(5):767-777. doi: 10.1038/s41588-024-01704-y. Epub 2024 Apr 30.

Inferring disease architecture and predictive ability with LDpred2-auto.利用 LDpred2-auto 推断疾病结构和预测能力。

Am J Hum Genet. 2023 Dec 7;110(12):2042-2055. doi: 10.1016/j.ajhg.2023.10.010. Epub 2023 Nov 8.

Plasma proteomic associations with genetics and health in the UK Biobank.英国生物库中血浆蛋白质组与遗传学和健康的关联。

Nature. 2023 Oct;622(7982):329-338. doi: 10.1038/s41586-023-06592-6. Epub 2023 Oct 4.

Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies.在具有多种祖先的关联研究中，极度稀疏的连锁不平衡模型。

Nat Genet. 2023 Sep;55(9):1494-1502. doi: 10.1038/s41588-023-01487-8. Epub 2023 Aug 28.

Fast and accurate Bayesian polygenic risk modeling with variational inference.基于变分推断的快速准确贝叶斯多基因风险建模。

Am J Hum Genet. 2023 May 4;110(5):741-761. doi: 10.1016/j.ajhg.2023.03.009. Epub 2023 Apr 7.

LDmat: efficiently queryable compression of linkage disequilibrium matrices.LDmat：高效可查询的连锁不平衡矩阵压缩。

Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad092.

Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores.识别并校正全基因组关联研究汇总统计数据和多基因评分中的错误设定。

HGG Adv. 2022 Aug 18;3(4):100136. doi: 10.1016/j.xhgg.2022.100136. eCollection 2022 Oct 13.

Fine-mapping from summary data with the "Sum of Single Effects" model.基于“单一效应总和”模型的汇总数据精细定位。

PLoS Genet. 2022 Jul 19;18(7):e1010299. doi: 10.1371/journal.pgen.1010299. eCollection 2022 Jul.

Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores.利用精细映射和多人群训练数据提高跨人群多基因风险评分。

Nat Genet. 2022 Apr;54(4):450-458. doi: 10.1038/s41588-022-01036-9. Epub 2022 Apr 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用快速且内存高效的算法进行多基因评分的全基因组推断。

Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献