聚类代表对 k-均值聚类收敛性的影响。

The impact of cluster representatives on the convergence of the k-modes type clustering.

机构信息

School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2013 Jun;35(6):1509-22. doi: 10.1109/TPAMI.2012.228.

DOI:10.1109/TPAMI.2012.228

Abstract

As a leading partitional clustering technique, k-modes is one of the most computationally efficient clustering methods for categorical data. In the k-modes, a cluster is represented by a "mode," which is composed of the attribute value that occurs most frequently in each attribute domain of the cluster, whereas, in real applications, using only one attribute value in each attribute to represent a cluster may not be adequate as it could in turn affect the accuracy of data analysis. To get rid of this deficiency, several modified clustering algorithms were developed by assigning appropriate weights to several attribute values in each attribute. Although these modified algorithms are quite effective, their convergence proofs are lacking. In this paper, we analyze their convergence property and prove that they cannot guarantee to converge under their optimization frameworks unless they degrade to the original k-modes type algorithms. Furthermore, we propose two different modified algorithms with weighted cluster prototypes to overcome the shortcomings of these existing algorithms. We rigorously derive updating formulas for the proposed algorithms and prove the convergence of the proposed algorithms. The experimental studies show that the proposed algorithms are effective and efficient for large categorical datasets.

摘要

作为一种主要的分区聚类技术，k-均值是用于分类数据的最有效计算方法之一。在 k-均值中，一个聚类由一个“模式”表示，该模式由聚类中每个属性域中出现最频繁的属性值组成，然而，在实际应用中，仅使用每个属性中的一个属性值来表示聚类可能不够充分，因为这反过来可能会影响数据分析的准确性。为了摆脱这一缺陷，开发了几种修改后的聚类算法，为每个属性中的几个属性值分配适当的权重。尽管这些修改后的算法非常有效，但它们的收敛证明却缺乏。在本文中，我们分析了它们的收敛特性，并证明除非它们退化到原始的 k-均值类型算法，否则它们在其优化框架下无法保证收敛。此外，我们提出了两种具有加权聚类原型的不同修改算法，以克服现有算法的缺点。我们严格推导了所提出算法的更新公式，并证明了所提出算法的收敛性。实验研究表明，所提出的算法对于大型分类数据集是有效和高效的。

相似文献

The impact of cluster representatives on the convergence of the k-modes type clustering.聚类代表对 k-均值聚类收敛性的影响。

IEEE Trans Pattern Anal Mach Intell. 2013 Jun;35(6):1509-22. doi: 10.1109/TPAMI.2012.228.

On the impact of dissimilarity measure in k-modes clustering algorithm.关于差异度量在k-模式聚类算法中的影响。

IEEE Trans Pattern Anal Mach Intell. 2007 Mar;29(3):503-7. doi: 10.1109/TPAMI.2007.53.

An Algorithm for Clustering Categorical Data With Set-Valued Features.一种用于对具有集值特征的分类数据进行聚类的算法。

IEEE Trans Neural Netw Learn Syst. 2018 Oct;29(10):4593-4606. doi: 10.1109/TNNLS.2017.2770167. Epub 2017 Nov 29.

Efficient layered density-based clustering of categorical data.分类数据的高效分层基于密度的聚类

J Biomed Inform. 2009 Apr;42(2):365-76. doi: 10.1016/j.jbi.2008.11.004. Epub 2008 Dec 10.

A novel artificial bee colony based clustering algorithm for categorical data.一种用于分类数据的基于新型人工蜂群的聚类算法。

PLoS One. 2015 May 20;10(5):e0127125. doi: 10.1371/journal.pone.0127125. eCollection 2015.

Clustering Categorical Data Using Community Detection Techniques.使用社区发现技术对分类数据进行聚类。

Comput Intell Neurosci. 2017;2017:8986360. doi: 10.1155/2017/8986360. Epub 2017 Dec 21.

Coupled attribute similarity learning on categorical data.基于类别数据的耦合属性相似性学习。

IEEE Trans Neural Netw Learn Syst. 2015 Apr;26(4):781-97. doi: 10.1109/TNNLS.2014.2325872.

Rough set based information theoretic approach for clustering uncertain categorical data.基于粗糙集的信息论聚类不确定分类数据方法。

PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.

General C-means clustering model.通用C均值聚类模型。

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1197-211. doi: 10.1109/TPAMI.2005.160.

Space Structure and Clustering of Categorical Data.空间结构与分类数据聚类。

IEEE Trans Neural Netw Learn Syst. 2016 Oct;27(10):2047-59. doi: 10.1109/TNNLS.2015.2451151. Epub 2015 Oct 2.

引用本文的文献

Development and validation of risk stratification and shared decision-making tool for catheter ablation for atrial fibrillation in patients with heart failure: a multicentre cohort study.心力衰竭患者心房颤动导管消融风险分层与共同决策工具的开发与验证：一项多中心队列研究

EClinicalMedicine. 2025 Apr 28;83:103219. doi: 10.1016/j.eclinm.2025.103219. eCollection 2025 May.

Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutions.聚类算法的综合分析：探索局限性与创新解决方案。

PeerJ Comput Sci. 2024 Aug 29;10:e2286. doi: 10.7717/peerj-cs.2286. eCollection 2024.

Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies.医学统计分析与数据挖掘中测量尺度的处理：方法综述

J Res Med Sci. 2014 Jan;19(1):47-56.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

聚类代表对 k-均值聚类收敛性的影响。

The impact of cluster representatives on the convergence of the k-modes type clustering.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献