具有未知聚类数的分类数据和数值数据的子空间聚类

Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters.

作者信息

Jia Hong, Cheung Yiu-Ming

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Aug;29(8):3308-3325. doi: 10.1109/TNNLS.2017.2728138. Epub 2017 Aug 3.

DOI:10.1109/TNNLS.2017.2728138

Abstract

In clustering analysis, data attributes may have different contributions to the detection of various clusters. To solve this problem, the subspace clustering technique has been developed, which aims at grouping the data objects into clusters based on the subsets of attributes rather than the entire data space. However, the most existing subspace clustering methods are only applicable to either numerical or categorical data, but not both. This paper, therefore, studies the soft subspace clustering of data with both of the numerical and categorical attributes (also simply called mixed data for short). Specifically, an attribute-weighted clustering model based on the definition of object-cluster similarity is presented. Accordingly, a unified weighting scheme for the numerical and categorical attributes is proposed, which quantifies the attribute-to-cluster contribution by taking into account both of intercluster difference and intracluster similarity. Moreover, a rival penalized competitive learning mechanism is further introduced into the proposed soft subspace clustering algorithm so that the subspace cluster structure as well as the most appropriate number of clusters can be learned simultaneously in a single learning paradigm. In addition, an initialization-oriented method is also presented, which can effectively improve the stability and accuracy of -means-type clustering methods on numerical, categorical, and mixed data. The experimental results on different benchmark data sets show the efficacy of the proposed approach.

摘要

在聚类分析中，数据属性对不同聚类的检测可能有不同的贡献。为了解决这个问题，人们开发了子空间聚类技术，其目的是基于属性子集而不是整个数据空间将数据对象分组为聚类。然而，现有的大多数子空间聚类方法仅适用于数值数据或分类数据，不能同时适用于两者。因此，本文研究具有数值和分类属性的数据（简称为混合数据）的软子空间聚类。具体而言，提出了一种基于对象-聚类相似性定义的属性加权聚类模型。相应地，提出了一种针对数值和分类属性的统一加权方案，该方案通过同时考虑类间差异和类内相似性来量化属性对聚类的贡献。此外，在所提出的软子空间聚类算法中进一步引入了竞争惩罚竞争学习机制，以便在单一学习范式中同时学习子空间聚类结构以及最合适的聚类数量。另外，还提出了一种面向初始化的方法，该方法可以有效提高均值型聚类方法在数值、分类和混合数据上的稳定性和准确性。在不同基准数据集上的实验结果表明了所提方法的有效性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

具有未知聚类数的分类数据和数值数据的子空间聚类

Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters.

作者信息

出版信息

相似文献

引用本文的文献

具有未知聚类数的分类数据和数值数据的子空间聚类

Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters.

作者信息

出版信息

相似文献

引用本文的文献