Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York, United States of America.
PLoS One. 2010 Aug 18;5(8):e11892. doi: 10.1371/journal.pone.0011892.
Recent large-scale studies of European populations have demonstrated the existence of population genetic structure within Europe and the potential to accurately infer individual ancestry when information from hundreds of thousands of genetic markers is used. In fact, when genomewide genetic variation of European populations is projected down to a two-dimensional Principal Components Analysis plot, a surprising correlation with actual geographic coordinates of self-reported ancestry has been reported. This substructure can hamper the search of susceptibility genes for common complex disorders leading to spurious correlations. The identification of genetic markers that can correct for population stratification becomes therefore of paramount importance. Analyzing 1,200 individuals from 11 populations genotyped for more than 500,000 SNPs (Population Reference Sample), we present a systematic exploration of the extent to which geographic coordinates of origin within Europe can be predicted, with small panels of SNPs. Markers are selected to correlate with the top principal components of the dataset, as we have previously demonstrated. Performing thorough cross-validation experiments we show that it is indeed possible to predict individual ancestry within Europe down to a few hundred kilometers from actual individual origin, using information from carefully selected panels of 500 or 1,000 SNPs. Furthermore, we show that these panels can be used to correctly assign the HapMap Phase 3 European populations to their geographic origin. The SNPs that we propose can prove extremely useful in a variety of different settings, such as stratification correction or genetic ancestry testing, and the study of the history of European populations.
最近的大规模欧洲人群研究表明,欧洲内部存在人口遗传结构,并且当使用数十万遗传标记的信息时,可以准确推断个体的祖先。事实上,当欧洲人群的全基因组遗传变异被投射到二维主成分分析图上时,与自我报告的祖先的实际地理坐标之间存在惊人的相关性。这种亚结构可能会阻碍对常见复杂疾病的易感基因的搜索,从而导致虚假相关。因此,确定可以纠正人群分层的遗传标记变得至关重要。我们分析了来自 11 个群体的 1200 个人,这些个体的基因型超过 50 万个 SNP(人口参考样本),我们系统地探索了使用小面板 SNP 可以在多大程度上预测欧洲内部的起源地理坐标。我们选择与数据集的前几个主成分相关的标记,就像我们之前所证明的那样。通过进行彻底的交叉验证实验,我们表明,使用精心挑选的 500 或 1000 个 SNP 面板的信息,确实可以从实际个体起源的几百公里内预测欧洲个体的祖先。此外,我们表明,这些面板可用于正确地将 HapMap Phase 3 欧洲人群分配到其地理起源。我们提出的 SNP 可以在各种不同的环境中非常有用,例如分层校正或遗传祖先测试以及欧洲人群历史的研究。
Commun Biol. 2021-2-5
BMC Med Genomics. 2019-11-27
PLoS One. 2015-4-17
Front Genet. 2015-3-17
G3 (Bethesda). 2014-11-3
Front Genet. 2013-9-9
Semin Reprod Med. 2013-8-9
Ann Eugen. 1951-3
Nature. 2008-11-6
Curr Biol. 2008-8-26