Bioinformatics Program and the University of Guelph, Canada.
Department of Mathematics and Statistics, University of Guelph, Canada.
J Theor Biol. 2019 Jul 7;472:36-45. doi: 10.1016/j.jtbi.2019.04.002. Epub 2019 Apr 4.
There have been longstanding concerns about the stability of hierarchical clustering. A suggested explanation for this instability is the presence of "rogue taxa", i.e. taxa whose removal from a data set can apparently restore stability. In this study, the rogue taxa hypothesis is tested by partitioning a large data set into many smaller ones and checking for rogue behavior. The checking was performed with a standard hierarchical clustering algorithm and with a novel algorithm designed to have greater stability. It was found that rogue taxa cannot reasonably be said to exist because the status of being a rogue taxon depends on the data partition in which the taxon is embedded. In addition to the choice of data used, the choice of algorithm and algorithm parameters can have a large effect on the degree to which a taxon appears rogue. Instability in hierarchical clustering can be increased by problematic data points, but the status of data points being problematic depends not on their biological antecedents, but on their position in the local geometry of the data. The results of this study strongly suggest that instability in traditional hierarchical clustering routines is primarily a problem with the algorithm design.
长期以来,人们一直对层次聚类的稳定性存在担忧。一种被认为是导致这种不稳定性的解释是存在“流氓分类单元”,即从数据集中移除这些分类单元显然可以恢复稳定性。在这项研究中,通过将大型数据集划分为许多较小的数据集,并检查是否存在“流氓行为”,来检验流氓分类单元假说。使用标准的层次聚类算法和一种新设计的算法来检查是否存在“流氓行为”,这种新算法旨在具有更高的稳定性。研究结果表明,不能合理地说存在“流氓分类单元”,因为一个分类单元是否为“流氓分类单元”取决于该分类单元所嵌入的数据分区。除了所使用的数据选择之外,算法和算法参数的选择也会对分类单元表现出“流氓行为”的程度产生很大的影响。层次聚类的不稳定性可能会因存在问题的数据点而增加,但是数据点是否存在问题取决于它们在数据局部几何中的位置,而不是它们的生物学背景。这项研究的结果强烈表明,传统层次聚类程序中的不稳定性主要是算法设计的问题。