Topchy Alexander, Jain Anil K, Punch William
Nielsen Media Research, 501 Brooker Creek Blvd., Oldsmar, FL 34677, USA.
IEEE Trans Pattern Anal Mach Intell. 2005 Dec;27(12):1866-81. doi: 10.1109/TPAMI.2005.237.
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.
聚类集成已成为一种强大的方法,可提高无监督分类解决方案的稳健性和稳定性。然而,从多个划分中找到一个共识聚类是一个难题,可以从基于图、组合或统计的角度来解决。本研究在几个方面扩展了先前关于聚类集成的研究。首先,我们为多个聚类引入了统一表示,并制定了相应的分类聚类问题。其次,我们在聚类空间中使用多项分布的有限混合提出了一种共识概率模型。使用期望最大化(EM)算法找到组合划分作为相应最大似然问题的解决方案。第三,我们使用广义互信息定义定义了一个与经典类内方差准则相关的新共识函数。最后,我们展示了结合使用数据投影和随机数据分割的弱聚类算法生成的划分的有效性。为这种弱聚类组件的组合行为提供了一个简单的解释模型。组合准确率作为控制组件划分的能力和分辨率以及划分数量的几个参数的函数进行分析。我们还分析了具有不完整信息的聚类集成以及缺失聚类标签对整体共识质量的影响。实验结果证明了所提出方法在几个真实世界数据集上的有效性。