Hore Prodip, Hall Lawrence O, Goldgof Dmitry B
Department of Computer Science and Engineering, ENB118, University of South Florida, Tampa, Florida 33620.
Pattern Recognit. 2009 May;42(5):676-688. doi: 10.1016/j.patcog.2008.09.027.
An ensemble of clustering solutions or partitions may be generated for a number of reasons. If the data set is very large, clustering may be done on tractable size disjoint subsets. The data may be distributed at different sites for which a distributed clustering solution with a final merging of partitions is a natural fit. In this paper, two new approaches to combining partitions, represented by sets of cluster centers, are introduced. The advantage of these approaches is that they provide a final partition of data that is comparable to the best existing approaches, yet scale to extremely large data sets. They can be 100,000 times faster while using much less memory. The new algorithms are compared against the best existing cluster ensemble merging approaches, clustering all the data at once and a clustering algorithm designed for very large data sets. The comparison is done for fuzzy and hard k-means based clustering algorithms. It is shown that the centroid-based ensemble merging algorithms presented here generate partitions of quality comparable to the best label vector approach or clustering all the data at once, while providing very large speedups.
出于多种原因,可能会生成一组聚类解决方案或划分。如果数据集非常大,可以对易于处理的大小不相交的子集进行聚类。数据可能分布在不同的站点,对于这种情况,具有最终划分合并的分布式聚类解决方案是很自然的选择。在本文中,介绍了两种以聚类中心集表示的组合划分的新方法。这些方法的优点是它们提供的数据最终划分与现有的最佳方法相当,但能扩展到极大的数据集。它们可以快10万倍,同时使用的内存要少得多。将新算法与现有的最佳聚类集成合并方法、一次性对所有数据进行聚类以及为非常大的数据集设计的聚类算法进行了比较。针对基于模糊和硬k均值的聚类算法进行了比较。结果表明,本文提出的基于质心的集成合并算法生成的划分质量与最佳标签向量方法或一次性对所有数据进行聚类相当,同时提供了非常大的加速比。