Allocco Dominic J, Song Qing, Gibbons Gary H, Ramoni Marco F, Kohane Isaac S
Children's Hospital Informatics Program at Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA.
BMC Genomics. 2007 Mar 10;8:68. doi: 10.1186/1471-2164-8-68.
Recent studies have shown that when individuals are grouped on the basis of genetic similarity, group membership corresponds closely to continental origin. There has been considerable debate about the implications of these findings in the context of larger debates about race and the extent of genetic variation between groups. Some have argued that clustering according to continental origin demonstrates the existence of significant genetic differences between groups and that these differences may have important implications for differences in health and disease. Others argue that clustering according to continental origin requires the use of large amounts of genetic data or specifically chosen markers and is indicative only of very subtle genetic differences that are unlikely to have biomedical significance.
We used small numbers of randomly selected single nucleotide polymorphisms (SNPs) from the International HapMap Project to train naïve Bayes classifiers for prediction of ancestral continent of origin. Predictive accuracy was tested on two independent data sets. Genetically similar groups should be difficult to distinguish, especially if only a small number of genetic markers are used. The genetic differences between continentally defined groups are sufficiently large that one can accurately predict ancestral continent of origin using only a minute, randomly selected fraction of the genetic variation present in the human genome. Genotype data from only 50 random SNPs was sufficient to predict ancestral continent of origin in our primary test data set with an average accuracy of 95%. Genetic variations informative about ancestry were common and widely distributed throughout the genome.
Accurate characterization of ancestry is possible using small numbers of randomly selected SNPs. The results presented here show how investigators conducting genetic association studies can use small numbers of arbitrarily chosen SNPs to identify stratification in study subjects and avoid false positive genotype-phenotype associations. Our findings also demonstrate the extent of variation between continentally defined groups and argue strongly against the contention that genetic differences between groups are too small to have biomedical significance.
近期研究表明,当根据基因相似性对个体进行分组时,组成员身份与大陆起源密切对应。在关于种族及群体间基因变异程度的更大规模辩论背景下,这些发现的意义引发了相当多的争论。一些人认为,按大陆起源进行聚类证明了群体间存在显著的基因差异,且这些差异可能对健康和疾病差异具有重要影响。另一些人则认为,按大陆起源进行聚类需要使用大量基因数据或特定选择的标记,且仅表明存在非常细微的基因差异,不太可能具有生物医学意义。
我们使用了来自国际人类基因组单体型图计划(International HapMap Project)的少量随机选择的单核苷酸多态性(SNP)来训练朴素贝叶斯分类器,以预测祖先大陆起源。在两个独立数据集上测试了预测准确性。基因相似的群体应难以区分,尤其是在仅使用少量基因标记的情况下。按大陆定义的群体间的基因差异足够大,以至于仅使用人类基因组中存在的基因变异的一小部分随机选择部分,就能准确预测祖先大陆起源。在我们的主要测试数据集中,仅50个随机SNP的基因型数据就足以预测祖先大陆起源,平均准确率为95%。关于祖先的信息丰富的基因变异普遍存在且广泛分布于整个基因组中。
使用少量随机选择的SNP可以准确表征祖先。此处呈现的结果表明,进行基因关联研究的研究人员可以如何使用少量任意选择的SNP来识别研究对象中的分层,并避免假阳性的基因型 - 表型关联。我们的发现还证明了按大陆定义的群体间的变异程度,并强烈反对群体间基因差异太小而不具有生物医学意义的观点。