Shemirani Ruhollah, Belbin Gillian M, Cullina Sinead, Caggiano Christa, Gignoux Christopher, Zaitlen Noah, Kenny Eimear E
Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
medRxiv. 2025 Jun 5:2025.06.04.25328990. doi: 10.1101/2025.06.04.25328990.
Population structure is a well-known confounder in statistical genetics, particularly in genome-wide association studies (GWAS), where it can lead to inflated test statistics and spurious associations. Traditional methods, such as principal components (PCs), commonly used to adjust for population structure, are limited in capturing fine-scale, non-linear patterns that arise from recent demographic events - patterns that are crucial for understanding rare variant effects. To address this challenge, we propose a novel method called SPectral Components (SPCs), which leverages identity-by-descent (IBD) graphs to capture and transform local, non-linear fine-scale population structure into continuous representations that can be seamlessly integrated into genetic analysis pipelines. Using both simulated datasets and empirical data from the UK Biobank (N ≈ 420,000), we demonstrate that SPCs outperform PCs in adjusting for fine-scale population structure. In simulations, SPCs explained over 90% of the fine-scale population structure with fewer components, while PCs captured less than 50%. In the UK Biobank, SPCs reduced the inflation of p-values in the GWAS of an environmental-driven phenotype by 12% compared to PCs, while maintaining a similar performance to PCs in height, a highly heritable phenotype. Additionally, SPCs improved rare variant association analyses, reducing genomic inflation (e.g., from 7.6 to 1.2 in one analysis), and provided more accurate heritability estimates. Spatial autocorrelation analysis further confirmed the ability of SPCs to account for environmental effects, reducing Moran's I for both environmental and heritable phenotypes more effectively than PCs. Overall, our findings demonstrate that SPCs provide a robust, scalable adjustment for recent population structure, offering a powerful alternative or complement to PCs in large-scale biobank studies.
群体结构是统计遗传学中一个众所周知的混杂因素,尤其是在全基因组关联研究(GWAS)中,它可能导致检验统计量膨胀和虚假关联。传统方法,如主成分(PC),通常用于调整群体结构,但在捕捉由近期人口事件产生的精细尺度、非线性模式方面存在局限性,而这些模式对于理解罕见变异效应至关重要。为应对这一挑战,我们提出了一种名为谱成分(SPC)的新方法,该方法利用同源片段(IBD)图来捕捉局部非线性精细尺度群体结构,并将其转化为可无缝集成到遗传分析流程中的连续表示形式。使用模拟数据集和来自英国生物银行(N≈420,000)的实证数据,我们证明SPC在调整精细尺度群体结构方面优于PC。在模拟中,SPC用更少的成分解释了超过90%的精细尺度群体结构,而PC捕获的不到50%。在英国生物银行中,与PC相比,SPC在环境驱动表型的GWAS中使p值的膨胀降低了12%,同时在高度这一高度可遗传表型上保持了与PC相似的性能。此外,SPC改进了罕见变异关联分析,减少了基因组膨胀(例如,在一次分析中从7.6降至1.2),并提供了更准确的遗传力估计。空间自相关分析进一步证实了SPC解释环境效应的能力,比PC更有效地降低了环境和可遗传表型的莫兰指数I。总体而言,我们的研究结果表明,SPC为近期群体结构提供了一种稳健、可扩展的调整方法,在大规模生物银行研究中为PC提供了有力的替代或补充。