Chen G-B, Lee S H, Zhu Z-X, Benyamin B, Robinson M R
Queensland Brain Institute, The University of Queensland, Brisbane, Queensland, Australia.
School of Environmental and Rural Science, The University of New England, Armidale, New South Wales, Australia.
Heredity (Edinb). 2016 Jul;117(1):51-61. doi: 10.1038/hdy.2016.25. Epub 2016 May 4.
We develop a novel approach to identify regions of the genome underlying population genetic differentiation in any genetic data where the underlying population structure is unknown, or where the interest is assessing divergence along a gradient. By combining the statistical framework for genome-wide association studies (GWASs) with eigenvector decomposition (EigenGWAS), which is commonly used in population genetics to characterize the structure of genetic data, loci under selection can be identified without a requirement for discrete populations. We show through theory and simulation that our approach can identify regions under selection along gradients of ancestry, and in real data we confirm this by demonstrating LCT to be under selection between HapMap CEU-TSI cohorts, and we then validate this selection signal across European countries in the POPRES samples. HERC2 was also found to be differentiated between both the CEU-TSI cohort and within the POPRES sample, reflecting the likely anthropological differences in skin and hair colour between northern and southern European populations. Controlling for population stratification is of great importance in any quantitative genetic study and our approach also provides a simple, fast and accurate way of predicting principal components in independent samples. With ever increasing sample sizes across many fields, this approach is likely to be greatly utilized to gain individual-level eigenvectors avoiding the computational challenges associated with conducting singular value decomposition in large data sets. We have developed freely available software, Genetic Analysis Repository (GEAR), to facilitate the application of the methods.
我们开发了一种新方法,可在潜在群体结构未知或关注沿梯度评估分化的任何遗传数据中,识别基因组中构成群体遗传分化基础的区域。通过将全基因组关联研究(GWAS)的统计框架与特征向量分解(EigenGWAS)相结合(EigenGWAS常用于群体遗传学中表征遗传数据的结构),无需离散群体即可识别受选择的基因座。我们通过理论和模拟表明,我们的方法可以识别沿祖先梯度受选择的区域,在实际数据中,我们通过证明乳糖酶(LCT)在HapMap CEU - TSI队列之间受到选择来证实这一点,然后我们在POPRES样本中验证了欧洲各国之间的这种选择信号。还发现HERC2在CEU - TSI队列之间以及在POPRES样本内部存在差异,这反映了北欧和南欧人群在皮肤和头发颜色方面可能存在的人类学差异。在任何定量遗传研究中,控制群体分层都非常重要,我们的方法还提供了一种简单、快速且准确的方法来预测独立样本中的主成分。随着许多领域样本量的不断增加,这种方法可能会被大量使用,以获得个体水平的特征向量,避免在大数据集中进行奇异值分解所带来的计算挑战。我们开发了免费软件“遗传分析库”(GEAR),以促进这些方法的应用。