Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089.
Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
Genetics. 2019 Jan;211(1):289-304. doi: 10.1534/genetics.118.301747. Epub 2018 Nov 20.
Population structure leads to systematic patterns in measures of mean relatedness between individuals in large genomic data sets, which are often discovered and visualized using dimension reduction techniques such as principal component analysis (PCA). Mean relatedness is an average of the relationships across locus-specific genealogical trees, which can be strongly affected on intermediate genomic scales by linked selection and other factors. We show how to use local PCA to describe this intermediate-scale heterogeneity in patterns of relatedness, and apply the method to genomic data from three species, finding in each that the effect of population structure can vary substantially across only a few megabases. In a global human data set, localized heterogeneity is likely explained by polymorphic chromosomal inversions. In a range-wide data set of , factors that produce heterogeneity are shared between chromosomes, correlate with local gene density, and may be caused by linked selection, such as background selection or local adaptation. In a data set of primarily African , large-scale heterogeneity across each chromosome arm is explained by known chromosomal inversions thought to be under recent selection and, after removing samples carrying inversions, remaining heterogeneity is correlated with recombination rate and gene density, again suggesting a role for linked selection. The visualization method provides a flexible new way to discover biological drivers of genetic variation, and its application to data highlights the strong effects that linked selection and chromosomal inversions can have on observed patterns of genetic variation.
人口结构导致了在大型基因组数据集个体间平均亲缘关系度量的系统模式,这些模式通常通过降维技术(如主成分分析 (PCA))来发现和可视化。平均亲缘关系是基于特定基因座的系统发育树的关系的平均值,这些关系可能会受到连锁选择和其他因素的强烈影响。我们展示了如何使用局部 PCA 来描述这种亲缘关系模式中的中间尺度异质性,并将该方法应用于来自三个物种的基因组数据,发现每个物种的群体结构效应仅在少数兆碱基范围内就会发生很大变化。在全球人类数据集,局部异质性可能是由多态性染色体倒位解释的。在一个广泛的因素产生异质性的染色体范围内数据集中,与局部基因密度相关的因素在染色体之间共享,可能是由连锁选择引起的,如背景选择或局部适应。在一个主要来自非洲的数据集,每个染色体臂上的大规模异质性都可以用最近受到选择的已知染色体倒位来解释,在去除携带倒位的样本后,剩余的异质性与重组率和基因密度相关,再次表明连锁选择的作用。该可视化方法为发现遗传变异的生物学驱动因素提供了一种灵活的新方法,其在数据中的应用强调了连锁选择和染色体倒位对观察到的遗传变异模式的强烈影响。