Cataldo-Ramirez Chelsea C, Lin Meng, Mcmahon Aislinn, Gignoux Christopher R, Weaver Timothy D, Henn Brenna M
Department of Anthropology, University of California Davis, Davis, CA, 95616, USA.
Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, CA 91001, USA.
bioRxiv. 2024 Oct 29:2024.10.28.620716. doi: 10.1101/2024.10.28.620716.
Genome-wide association studies (GWAS) and polygenic score (PGS) development are typically constrained by the data available in biobank repositories in which European cohorts are vastly overrepresented. Here, we increase the utility of non-European participant data within the UK Biobank (UKB) by characterizing the genetic affinities of UKB participants who self-identify as Bangladeshi, Indian, Pakistani, "White and Asian" (WA), and "Any Other Asian" (AOA), towards creating a more robust South Asian sample size for future genetic analyses. We assess the relationships between genetic structure and self-selected ethnic identities resulting in consistent patterns of clustering used to train a support vector machine (SVM). The SVM model was utilized to reassign = 1,853 AOA and WA participants at the subcontinental level, and increase the sample size of the UKB South Asian group by 1,381 additional participants. We then leverage these samples to assess GWAS performance and PGS development. We further include environmental covariates in the height GWAS by implementing a rigorous covariate selection procedure, and compare the outputs of two GWAS models: GWAS and GWAS. We show that PGS performance derived from environmentally adjusted GWAS yields comparable prediction to PGS models developed with an order of magnitude larger training dataset ( =0.021 vs 0.026). Models with 7 - 8 environmental covariates double the variance explained by PGS alone. In summary, we demonstrate how GWAS performance can be improved by leveraging ambiguous ethnicity codes, ancestry matched imputation panels, and including environmental covariates.
全基因组关联研究(GWAS)和多基因评分(PGS)的发展通常受到生物样本库中可用数据的限制,其中欧洲队列的代表性远远超过其他地区。在这里,我们通过描述英国生物银行(UKB)中自我认定为孟加拉裔、印度裔、巴基斯坦裔、“白人和亚洲人”(WA)以及“其他任何亚洲人”(AOA)的参与者的遗传亲和力,提高了非欧洲参与者数据在UKB中的效用,以便为未来的遗传分析创建一个更强大的南亚样本量。我们评估了遗传结构与自我选择的种族身份之间的关系,从而得出用于训练支持向量机(SVM)的一致聚类模式。利用SVM模型在次大陆层面重新分配了1853名AOA和WA参与者,并使UKB南亚群体的样本量增加了1381名参与者。然后,我们利用这些样本评估GWAS性能和PGS发展情况。我们通过实施严格的协变量选择程序,在身高GWAS中进一步纳入环境协变量,并比较了两个GWAS模型的输出结果:GWAS和GWAS。我们表明,从环境调整后的GWAS得出的PGS性能与使用大一个数量级的训练数据集开发的PGS模型具有可比的预测能力(分别为0.021和0.026)。包含7 - 8个环境协变量的模型使PGS单独解释的方差增加了一倍。总之,我们展示了如何通过利用模糊的种族代码、祖先匹配的归因面板以及纳入环境协变量来提高GWAS性能。