School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China.
IEEE Trans Pattern Anal Mach Intell. 2013 Jun;35(6):1509-22. doi: 10.1109/TPAMI.2012.228.
As a leading partitional clustering technique, k-modes is one of the most computationally efficient clustering methods for categorical data. In the k-modes, a cluster is represented by a "mode," which is composed of the attribute value that occurs most frequently in each attribute domain of the cluster, whereas, in real applications, using only one attribute value in each attribute to represent a cluster may not be adequate as it could in turn affect the accuracy of data analysis. To get rid of this deficiency, several modified clustering algorithms were developed by assigning appropriate weights to several attribute values in each attribute. Although these modified algorithms are quite effective, their convergence proofs are lacking. In this paper, we analyze their convergence property and prove that they cannot guarantee to converge under their optimization frameworks unless they degrade to the original k-modes type algorithms. Furthermore, we propose two different modified algorithms with weighted cluster prototypes to overcome the shortcomings of these existing algorithms. We rigorously derive updating formulas for the proposed algorithms and prove the convergence of the proposed algorithms. The experimental studies show that the proposed algorithms are effective and efficient for large categorical datasets.
作为一种主要的分区聚类技术,k-均值是用于分类数据的最有效计算方法之一。在 k-均值中,一个聚类由一个“模式”表示,该模式由聚类中每个属性域中出现最频繁的属性值组成,然而,在实际应用中,仅使用每个属性中的一个属性值来表示聚类可能不够充分,因为这反过来可能会影响数据分析的准确性。为了摆脱这一缺陷,开发了几种修改后的聚类算法,为每个属性中的几个属性值分配适当的权重。尽管这些修改后的算法非常有效,但它们的收敛证明却缺乏。在本文中,我们分析了它们的收敛特性,并证明除非它们退化到原始的 k-均值类型算法,否则它们在其优化框架下无法保证收敛。此外,我们提出了两种具有加权聚类原型的不同修改算法,以克服现有算法的缺点。我们严格推导了所提出算法的更新公式,并证明了所提出算法的收敛性。实验研究表明,所提出的算法对于大型分类数据集是有效和高效的。