多群体PRS的近乎免费增强：从数据裂变到伪全基因组关联研究子采样

Almost Free Enhancement of Multi-Population PRS: From Data-Fission to Pseudo-GWAS Subsampling.

作者信息

Xu Leqi, Dong Yikai, Zeng Xiaowei, Bian Zeyu, Zhou Geyu, Guan Leying, Zhao Hongyu

机构信息

Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.

Department of Statistics and Data Science, Fudan University, Shanghai, China.

出版信息

bioRxiv. 2025 Jun 20:2025.06.16.659952. doi: 10.1101/2025.06.16.659952.

DOI:10.1101/2025.06.16.659952

PMID:40611909

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12224544/

Abstract

Many multi-population polygenic risk score (PRS) methods have been proposed to improve prediction accuracy in underrepresented populations; however, no single method outperforms other methods across all data scenarios. Although integrating PRS results across multiple methods and populations may lead to more accurate predictions, this approach may be limited by the availability of individual-level tuning data to calculate combination weights. In this manuscript, we introduce MIXPRS, a robust PRS integration framework based on data fission principles, to effectively combine multiple multi-population PRS methods using only genome-wide association study (GWAS) summary statistics from multiple populations. Specifically, MIXPRS employs SNP pruning to mitigate linkage disequilibrium (LD) mismatch between the training GWAS summary statistics and LD reference panels, and utilizes non-negative least squares regression to robustly estimate PRS combination weights. Extensive simulations and real-data analyses involving 22 continuous traits and four binary traits across five populations from the UK Biobank and All of Us datasets demonstrate that MIXPRS consistently outperforms the existing methods in prediction accuracy. Because MIXPRS relies solely on GWAS summary statistics, it enjoys broad accessibility, robustness, and generalizability for underrepresented populations.

摘要

已经提出了许多多群体多基因风险评分（PRS）方法来提高在代表性不足群体中的预测准确性；然而，在所有数据场景下，没有一种方法能优于其他方法。尽管跨多种方法和群体整合PRS结果可能会带来更准确的预测，但这种方法可能会受到个体水平调整数据可用性的限制，无法计算组合权重。在本手稿中，我们介绍了MIXPRS，这是一种基于数据裂变原理的稳健的PRS整合框架，仅使用来自多个群体的全基因组关联研究（GWAS）汇总统计数据，就能有效地结合多种多群体PRS方法。具体而言，MIXPRS采用单核苷酸多态性（SNP）剪枝来减轻训练GWAS汇总统计数据与连锁不平衡（LD）参考面板之间的连锁不平衡（LD）不匹配，并利用非负最小二乘回归来稳健地估计PRS组合权重。涉及英国生物银行和“我们所有人”数据集的五个群体中的22个连续性状和四个二元性状的广泛模拟和实际数据分析表明，MIXPRS在预测准确性方面始终优于现有方法。由于MIXPRS仅依赖于GWAS汇总统计数据，因此它对代表性不足的群体具有广泛的可及性、稳健性和通用性。