Grinde Kelsey E, Browning Brian L, Reiner Alexander P, Thornton Timothy A, Browning Sharon R
Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, Minnesota, United States of America.
Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, United States of America.
PLoS Genet. 2024 Dec 16;20(12):e1011242. doi: 10.1371/journal.pgen.1011242. eCollection 2024 Dec.
Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.
主成分分析(PCA)在全基因组关联研究(GWAS)中被广泛用于控制群体结构。顶级主成分(PCs)通常反映群体结构,但在确定需要多少个主成分以及确保主成分不会捕获其他假象(如具有非典型连锁不平衡(LD)的区域)方面存在挑战。针对后者,许多研究团队建议在进行主成分分析之前进行连锁不平衡修剪或排除已知的高连锁不平衡区域。然而,这些建议并未得到普遍实施,而且对全基因组关联研究的影响也未被充分理解,尤其是在混合群体的背景下。在本文中,我们研究了预处理以及全基因组关联研究模型中所包含的主成分数量对来自妇女健康倡议SNP健康协会资源的非裔美国样本以及两项精准医学全基因组测序项目贡献研究(杰克逊心脏研究和慢性阻塞性肺疾病遗传流行病学研究)的影响。在所有这三个样本中,我们发现第一个主成分与全基因组祖先高度相关,而后续的主成分往往捕获局部基因组特征。与各个主成分高度相关的遗传变异的模式及数量与之前针对欧洲人群的研究所观察到的情况不同,并导致了不同的下游结果:由于对撞机偏差现象,对这些主成分进行调整会产生有偏差的效应量估计以及虚假关联率升高。排除先前研究中确定的高连锁不平衡区域并不能解决这些问题。连锁不平衡修剪被证明更有效,但阈值的最佳选择因数据集而异。总之,我们的工作突出了在使用主成分分析来控制混合群体中的祖先异质性时出现的独特问题,并证明了仔细的预处理和诊断对于确保全基因组关联研究模型中不包含捕获多个局部基因组特征的主成分的重要性。