一种用于对具有集值特征的分类数据进行聚类的算法。

An Algorithm for Clustering Categorical Data With Set-Valued Features.

作者信息

Huang Joshua Zhexue

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Oct;29(10):4593-4606. doi: 10.1109/TNNLS.2017.2770167. Epub 2017 Nov 29.

DOI:10.1109/TNNLS.2017.2770167

Abstract

In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV- $k$ -modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV- $k$ -modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV- $k$ -modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.

摘要

在数据挖掘中，对象通常由一组特征表示，其中对象的每个特征只有一个值。然而，在现实中，一些特征可以具有多个值，例如，一个人有多个职位、爱好和电子邮件地址。这些特征可称为多值特征，在使用现有数据挖掘算法分析具有多值特征的数据时，通常会用虚拟特征来处理。在本文中，我们提出了一种SV-$k$-模式算法，用于对具有多值特征的分类数据进行聚类。在该算法中，定义了两个具有多值特征的对象之间的距离函数，并提出了聚类中心的多值模式表示。我们开发了一种启发式方法来在迭代聚类过程中更新聚类中心，以及一种初始化算法来选择初始聚类中心。分析了SV-$k$-模式算法的收敛性和复杂度。在来自五个不同应用的合成数据和真实数据上进行了实验。实验结果表明，在对真实数据进行聚类时，SV-$k$-模式算法比其他三种分类聚类算法表现更好，并且该算法可扩展到大数据。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种用于对具有集值特征的分类数据进行聚类的算法。

An Algorithm for Clustering Categorical Data With Set-Valued Features.

作者信息

出版信息

相似文献

引用本文的文献

一种用于对具有集值特征的分类数据进行聚类的算法。

An Algorithm for Clustering Categorical Data With Set-Valued Features.

作者信息

出版信息

相似文献

引用本文的文献