Dawson K J, Belkhir K
Centre for Mathematical and Computational Biology, Rothamsted Research, Harpenden, Hertfordshire, UK.
Heredity (Edinb). 2009 Jul;103(1):32-45. doi: 10.1038/hdy.2009.29. Epub 2009 Apr 1.
Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals--the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. As the number of possible partitions grows very rapidly with the sample size, we cannot visualize this probability distribution in its entirety, unless the sample is very small. As a solution to this visualization problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package PartitionView. The exact linkage algorithm takes the posterior co-assignment probabilities as input and yields as output a rooted binary tree, or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities.
聚类问题(包括将个体聚类为异交群体、杂交世代、全同胞家系和自交系)最近在群体遗传学中受到了广泛关注。在这些聚类问题中,感兴趣的参数是抽样个体集合的一个划分——样本划分。在对这类聚类问题的全贝叶斯方法中,我们关于样本划分的知识由可能样本划分空间上的概率分布来表示。由于可能划分的数量随着样本量的增加而迅速增长,除非样本非常小,否则我们无法完整地可视化这个概率分布。作为解决这个可视化问题的方法,我们建议使用一种凝聚层次聚类算法,我们称之为精确连锁算法。该算法是我们之前介绍的最大最小聚类算法的一个特例。精确连锁算法现在已在我们的软件包PartitionView中实现。精确连锁算法将后验共分配概率作为输入,并输出一棵有根二叉树,或者更一般地,一组这样的树组成的森林。这个森林的每个节点定义一组个体,节点高度是该组的后验共分配概率。这为与个体分类相关的不确定性提供了一种有用的可视化表示。它也是从共分配概率方面更详细探索后验分布的一个有用起点。