Dias Julie-Alexia, Chen Tony, Xing Hua, Wang Xiaoyu, Rodriguez Alex A, Madduri Ravi K, Kraft Peter, Zhang Haoyu
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Division of Cancer Epidemiology & Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA; Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Rockville, MD, USA.
Am J Hum Genet. 2025 Aug 28. doi: 10.1016/j.ajhg.2025.08.006.
The increasing availability of diverse biobanks has enabled multi-ancestry genome-wide association studies (GWASs) to enhance the discovery of genetic variants across traits and diseases. However, the choice of an optimal method remains debated, due to challenges in statistical power differences across ancestral groups and approaches to account for population structure. Two primary strategies exist: (1) pooled analysis, which combines individuals from all genetic backgrounds into a single dataset while adjusting for population stratification using principal components, increasing the sample size and statistical power but requiring careful control of population stratification; and (2) meta-analysis, which performs ancestry-group-specific GWASs and subsequently combines summary statistics, potentially capturing fine-scale population structure but facing limitations in handling admixed individuals. Using large-scale simulations with varying sample sizes and ancestry compositions, we compare these methods alongside real data analyses of eight continuous and five binary traits from the UK Biobank (N ≈ 324,000) and the All of Us Research Program (N ≈ 207,000). Our results demonstrate that pooled analysis generally exhibits better statistical power while effectively adjusting for population stratification. We further present a theoretical framework linking power differences to allele-frequency variations across populations. These findings, validated across both biobanks, highlight pooled analysis as a powerful and scalable strategy for multi-ancestry GWASs, improving genetic discovery while maintaining rigorous population structure control.
越来越多的多样化生物样本库使得多祖先全基因组关联研究(GWAS)能够加强对跨性状和疾病的遗传变异的发现。然而,由于不同祖先群体在统计效力上存在差异以及应对群体结构的方法等挑战,最佳方法的选择仍存在争议。存在两种主要策略:(1)合并分析,即将来自所有遗传背景的个体合并到一个数据集中,同时使用主成分调整群体分层,这增加了样本量和统计效力,但需要仔细控制群体分层;(2)荟萃分析,即进行特定祖先群体的GWAS,随后合并汇总统计数据,这可能捕捉到精细尺度的群体结构,但在处理混合个体方面存在局限性。通过使用具有不同样本量和祖先组成的大规模模拟,我们将这些方法与来自英国生物样本库(N≈324,000)和“我们所有人”研究计划(N≈207,000)的八个连续性状和五个二元性状的实际数据分析进行了比较。我们的结果表明,合并分析通常具有更好的统计效力,同时能有效调整群体分层。我们还提出了一个理论框架,将效力差异与不同人群中的等位基因频率变化联系起来。这些在两个生物样本库中都得到验证的发现,突出了合并分析作为多祖先GWAS的一种强大且可扩展的策略,在保持严格的群体结构控制的同时改善了遗传发现。