Blanc Jennifer, Berg Jeremy J
Department of Human Genetics, University of Chicago, Chicago, IL, USA.
bioRxiv. 2024 Jun 26:2023.03.12.532301. doi: 10.1101/2023.03.12.532301.
Polygenic scores have become an important tool in human genetics, enabling the prediction of individuals' phenotypes from their genotypes. Understanding how the pattern of differences in polygenic score predictions across individuals intersects with variation in ancestry can provide insights into the evolutionary forces acting on the trait in question, and is important for understanding health disparities. However, because most polygenic scores are computed using effect estimates from population samples, they are susceptible to confounding by both genetic and environmental effects that are correlated with ancestry. The extent to which this confounding drives patterns in the distribution of polygenic scores depends on patterns of population structure in both the original estimation panel and in the prediction/test panel. Here, we use theory from population and statistical genetics, together with simulations, to study the procedure of testing for an association between polygenic scores and axes of ancestry variation in the presence of confounding. We use a general model of genetic relatedness to describe how confounding in the estimation panel biases the distribution of polygenic scores in a way that depends on the degree of overlap in population structure between panels. We then show how this confounding can bias tests for associations between polygenic scores and important axes of ancestry variation in the test panel. Specifically, for any given test, there exists a single axis of population structure in the GWAS panel that needs to be controlled for in order to protect the test. Based on this result, we propose a new approach for directly estimating this axis of population structure in the GWAS panel. We then use simulations to compare the performance of this approach to the standard approach in which the principal components of the GWAS panel genotypes are used to control for stratification.
多基因评分已成为人类遗传学中的一项重要工具,能够根据个体的基因型预测其表型。了解多基因评分预测在个体间的差异模式如何与祖先差异相互交织,有助于深入了解影响相关性状的进化力量,对于理解健康差异也很重要。然而,由于大多数多基因评分是使用来自人群样本的效应估计值计算得出的,它们容易受到与祖先相关的遗传和环境效应的混杂影响。这种混杂对多基因评分分布模式的影响程度取决于原始估计样本和预测/测试样本中的群体结构模式。在此,我们运用群体遗传学和统计遗传学理论,并结合模拟分析,研究在存在混杂因素的情况下,检验多基因评分与祖先变异轴之间关联的过程。我们使用遗传相关性的通用模型来描述估计样本中的混杂如何以一种取决于样本间群体结构重叠程度的方式使多基因评分的分布产生偏差。然后,我们展示了这种混杂如何使测试样本中多基因评分与重要祖先变异轴之间的关联检验产生偏差。具体而言,对于任何给定的检验,在全基因组关联研究(GWAS)样本中存在一个单一的群体结构轴,为了确保检验的有效性,需要对其进行控制。基于这一结果,我们提出了一种直接估计GWAS样本中这一群体结构轴的新方法。接着,我们通过模拟比较了该方法与使用GWAS样本基因型主成分来控制分层的标准方法的性能。