用于全球人群精细个体归属的祖先信息标记。

Ancestry informative markers for fine-scale individual assignment to worldwide populations.

机构信息

Department of Molecular Biology and Genetics, Democritus University of Thrace, Panepistimioupoli, Dragana, Ktirio 8, Alexandroupoli 68100, Greece.

出版信息

J Med Genet. 2010 Dec;47(12):835-47. doi: 10.1136/jmg.2010.078212. Epub 2010 Oct 4.

DOI:10.1136/jmg.2010.078212

PMID:20921023

Abstract

BACKGROUND AND AIMS

The analysis of large-scale genetic data from thousands of individuals has revealed the fact that subtle population genetic structure can be detected at levels that were previously unimaginable. Using the Human Genome Diversity Panel as reference (51 populations - 650,000 SNPs), this works describes a systematic evaluation of the resolution that can be achieved for the inference of genetic ancestry, even when small panels of genetic markers are used.

METHODS AND RESULTS

A comprehensive investigation of human population structure around the world is undertaken by leveraging the power of Principal Components Analysis (PCA). The problem is dissected into hierarchical steps and a decision tree for the prediction of individual ancestry is proposed. A complete leave-one-out validation experiment demonstrates that, using all available SNPs, assignment of individuals to their self-reported populations of origin is essentially perfect. Ancestry informative genetic markers are selected using two different metrics (In and correlation with PCA scores). A thorough cross-validation experiment indicates that, in most cases here, the number of SNPs needed for ancestry inference can be successfully reduced to less than 0.1% of the original 650,000 while retaining close to 100% accuracy. This reduction can be achieved using a novel clustering-based redundancy removal algorithm that is also introduced here. Finally, the applicability of our suggested SNP panels is tested on HapMap Phase 3 populations.

CONCLUSION

The proposed methods and ancestry informative marker panels, in combination with the increasingly more comprehensive databases of human genetic variation, open new horizons in a variety of fields, ranging from the study of human evolution and population history, to medical genetics and forensics.

摘要

背景与目的

对来自数千个人的大规模遗传数据的分析揭示了这样一个事实，即在以前难以想象的水平上可以检测到微妙的群体遗传结构。本文使用人类基因组多样性面板作为参考（51 个群体-65 万个 SNPs），系统地评估了即使使用小面板的遗传标记也可以实现遗传祖先推断的分辨率。

方法和结果

通过利用主成分分析（PCA）的强大功能，对全球人类种群结构进行了全面调查。将问题分解为分层步骤，并提出了一种用于预测个体祖先的决策树。完整的留一法验证实验表明，使用所有可用的 SNPs，个体被分配到他们自我报告的原籍人群体基本上是完美的。使用两种不同的度量标准（In 和与 PCA 得分的相关性）选择具有遗传祖先信息的遗传标记。彻底的交叉验证实验表明，在大多数情况下，用于祖先推断的 SNPs 数量可以成功减少到原始的 65 万的 0.1%以下，同时保持接近 100%的准确性。这可以通过引入的基于聚类的冗余消除算法来实现。最后，测试了我们建议的 SNP 面板在 HapMap 第 3 阶段人群中的适用性。