School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
Bioinformatics. 2011 Jul 1;27(13):i324-32. doi: 10.1093/bioinformatics/btr242.
Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.
We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.
To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.
StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp.
基因型数据的聚类是理解群体之间相似性和差异性的重要方法。通过聚类对群体进行总结,我们可以推断出群体的进化历史。已经提出了许多方法来对多位点基因型数据进行聚类。然而,这些方法中的大多数并没有直接解决数据应该分为多少个簇的问题,而是将这个选择留给用户。
我们提出了 StructHDP,这是一种在存在混合的情况下从基因型数据中自动推断簇数的方法。我们的方法是现有两种方法 Structure 和 Structurama 的扩展。我们使用分层 Dirichlet 过程 (HDP) 来对给定的基因型数据样本中未知数量的祖先群体的混合进行建模。我们使用 Gibbs 采样器对生成的模型进行推断,并推断出最能解释数据的祖先比例和簇数。
为了演示我们的方法,我们使用中性合并模型模拟了来自岛屿模型的数据。将 StructHDP 的结果与 Structurama 进行比较表明了将 HDP 与 Structure 模型结合使用的有效性。我们使用 StructHDP 分析了 155 只泰塔画眉(Turdus helleri)的数据,这些数据以前使用 Structure 和 Structurama 进行过分析。StructHDP 正确地选择了聚类数据的最佳群体数量。基于推断出的祖先比例进行聚类的结果也与使用 Structure 推断出的最佳群体数量的聚类结果一致。我们还分析了来自 53 个世界人群的 1048 个人的人类基因组多样性项目的数据。我们发现,获得的聚类与世界的主要地理分区相对应,这与对数据集的先前分析一致。
StructHDP 是用 C++编写的。代码将可在 http://www.sailing.cs.cmu.edu/structhdp 下载。