Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA.
Genome Biol. 2024 Oct 8;25(1):260. doi: 10.1186/s13059-024-03400-w.
Polygenic risk score (PRS) is a major research topic in human genetics. However, a significant gap exists between PRS methodology and applications in practice due to often unavailable individual-level data for various PRS tasks including model fine-tuning, benchmarking, and ensemble learning.
We introduce an innovative statistical framework to optimize and benchmark PRS models using summary statistics of genome-wide association studies. This framework builds upon our previous work and can fine-tune virtually all existing PRS models while accounting for linkage disequilibrium. In addition, we provide an ensemble learning strategy named PUMAS-ensemble to combine multiple PRS models into an ensemble score without requiring external data for model fitting. Through extensive simulations and analysis of many complex traits in the UK Biobank, we demonstrate that this approach closely approximates gold-standard analytical strategies based on external validation, and substantially outperforms state-of-the-art PRS methods.
Our method is a powerful and general modeling technique that can continue to combine the best-performing PRS methods out there through ensemble learning and could become an integral component for all future PRS applications.
多基因风险评分(PRS)是人类遗传学的一个主要研究课题。然而,由于各种 PRS 任务(包括模型微调、基准测试和集成学习)通常无法获得个体水平的数据,因此 PRS 方法学与实践应用之间存在显著差距。
我们引入了一种创新的统计框架,使用全基因组关联研究的汇总统计数据来优化和基准测试 PRS 模型。该框架建立在我们之前的工作基础上,可以微调几乎所有现有的 PRS 模型,同时考虑到连锁不平衡。此外,我们提供了一种名为 PUMAS-ensemble 的集成学习策略,用于将多个 PRS 模型组合成一个集成分数,而无需外部数据进行模型拟合。通过对 UK Biobank 中的许多复杂特征进行广泛的模拟和分析,我们证明了这种方法非常接近基于外部验证的黄金标准分析策略,并且大大优于最先进的 PRS 方法。
我们的方法是一种强大而通用的建模技术,可以通过集成学习继续结合表现最佳的 PRS 方法,并且可能成为所有未来 PRS 应用的一个组成部分。