Dias Julie-Alexia, Chen Tony, Xing Hua, Wang Xiaoyu, Rodriguez Alex A, Madduri Ravi K, Kraft Peter, Zhang Haoyu
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Division of Cancer Epidemiology & Genetics, National Cancer Institute, National Instituters of Health, Bethesda, MD, USA.
medRxiv. 2025 Mar 12:2025.03.11.25323772. doi: 10.1101/2025.03.11.25323772.
The increasing availability of diverse biobanks has enabled multi-ancestry genome-wide association studies (GWAS), enhancing the discovery of genetic variants across traits and diseases. However, the choice of an optimal method remains debated due to challenges in statistical power differences across ancestral groups and approaches to account for population structure. Two primary strategies exist: (1) Pooled analysis, which combines individuals from all genetic backgrounds into a single dataset while adjusting for population stratification using principal components, increasing the sample size and statistical power but requiring careful control of population stratification. (2) Meta-analysis, which performs ancestry-group-specific GWAS and subsequently combines summary statistics, potentially capturing fine-scale population structure, but facing limitations in handling admixed individuals. Using large-scale simulations with varying sample sizes and ancestry compositions, we compare these methods alongside real data analyses of eight continuous and five binary traits from the UK Biobank ( ) and All of Us Research Program ( ). Our results demonstrate that pooled analysis generally exhibits better statistical power while effectively adjusting for population stratification. We further present a theoretical framework linking power differences to allele frequency variations across populations. These findings, validated across both biobanks, highlight pooled analysis as a robust and scalable strategy for multi-ancestry GWAS, improving genetic discovery while maintaining rigorous population structure control.
越来越多不同的生物样本库使得多祖先全基因组关联研究(GWAS)成为可能,增强了对跨性状和疾病的遗传变异的发现。然而,由于不同祖先群体在统计效力差异以及考虑群体结构的方法方面存在挑战,最优方法的选择仍存在争议。存在两种主要策略:(1)合并分析,即将来自所有遗传背景的个体合并到一个数据集中,同时使用主成分调整群体分层,增加样本量和统计效力,但需要仔细控制群体分层。(2)荟萃分析,即进行特定祖先群体的GWAS,随后合并汇总统计数据,可能捕捉到精细的群体结构,但在处理混合个体方面存在局限性。通过使用具有不同样本量和祖先组成的大规模模拟,我们将这些方法与英国生物样本库( )和“我们所有人”研究计划( )中八个连续性状和五个二元性状的实际数据分析进行了比较。我们的结果表明,合并分析通常具有更好的统计效力,同时能有效调整群体分层。我们进一步提出了一个理论框架,将效力差异与不同群体间的等位基因频率变化联系起来。这些在两个生物样本库中均得到验证的发现,突出了合并分析作为多祖先GWAS的一种稳健且可扩展的策略,在保持严格的群体结构控制的同时改善了遗传发现。