一种用于具有复杂聚类结构数据集的基于原型聚类的有效性指标。

A Validity Index for Prototype-Based Clustering of Data Sets With Complex Cluster Structures.

作者信息

Tasdemir K, Merenyi E

出版信息

IEEE Trans Syst Man Cybern B Cybern. 2011 Aug;41(4):1039-53. doi: 10.1109/TSMCB.2010.2104319. Epub 2011 Feb 4.

DOI:10.1109/TSMCB.2010.2104319

Abstract

Evaluation of how well the extracted clusters fit the true partitions of a data set is one of the fundamental challenges in unsupervised clustering because the data structure and the number of clusters are unknown a priori. Cluster validity indices are commonly used to select the best partitioning from different clustering results; however, they are often inadequate unless clusters are well separated or have parametrical shapes. Prototype-based clustering (finding of clusters by grouping the prototypes obtained by vector quantization of the data), which is becoming increasingly important for its effectiveness in the analysis of large high-dimensional data sets, adds another dimension to this challenge. For validity assessment of prototype-based clusterings, previously proposed indexes-mostly devised for the evaluation of point-based clusterings-usually perform poorly. The poor performance is made worse when the validity indexes are applied to large data sets with complicated cluster structure. In this paper, we propose a new index, Conn_Index, which can be applied to data sets with a wide variety of clusters of different shapes, sizes, densities, or overlaps. We construct Conn_Index based on inter- and intra-cluster connectivities of prototypes. Connectivities are defined through a "connectivity matrix", which is a weighted Delaunay graph where the weights indicate the local data distribution. Experiments on synthetic and real data indicate that Conn_Index outperforms existing validity indices, used in this paper, for the evaluation of prototype-based clustering results.

摘要

评估提取的聚类与数据集的真实划分的拟合程度是无监督聚类中的基本挑战之一，因为数据结构和聚类数量在事先是未知的。聚类有效性指标通常用于从不同的聚类结果中选择最佳划分；然而，除非聚类分得很开或具有参数化形状，否则它们往往并不充分。基于原型的聚类（通过对数据进行矢量量化获得的原型进行分组来找到聚类），因其在大型高维数据集分析中的有效性而变得越来越重要，这给这一挑战增添了新的维度。对于基于原型的聚类的有效性评估，先前提出的指标（大多是为基于点的聚类评估而设计的）通常表现不佳。当将有效性指标应用于具有复杂聚类结构的大型数据集时，这种不佳表现会更严重。在本文中，我们提出了一种新的指标Conn_Index，它可以应用于具有各种不同形状、大小、密度或重叠情况的聚类的数据集。我们基于原型的簇间和簇内连通性构建Conn_Index。连通性通过一个“连通性矩阵”来定义，该矩阵是一个加权德劳内图，其中权重表示局部数据分布。对合成数据和真实数据的实验表明，在评估基于原型的聚类结果时，Conn_Index优于本文中使用的现有有效性指标。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种用于具有复杂聚类结构数据集的基于原型聚类的有效性指标。

A Validity Index for Prototype-Based Clustering of Data Sets With Complex Cluster Structures.

作者信息

出版信息

相似文献

引用本文的文献

一种用于具有复杂聚类结构数据集的基于原型聚类的有效性指标。

A Validity Index for Prototype-Based Clustering of Data Sets With Complex Cluster Structures.

作者信息

出版信息

相似文献

引用本文的文献