Huang Joshua Zhexue
IEEE Trans Neural Netw Learn Syst. 2018 Oct;29(10):4593-4606. doi: 10.1109/TNNLS.2017.2770167. Epub 2017 Nov 29.
In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV- $k$ -modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV- $k$ -modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV- $k$ -modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.
在数据挖掘中,对象通常由一组特征表示,其中对象的每个特征只有一个值。然而,在现实中,一些特征可以具有多个值,例如,一个人有多个职位、爱好和电子邮件地址。这些特征可称为多值特征,在使用现有数据挖掘算法分析具有多值特征的数据时,通常会用虚拟特征来处理。在本文中,我们提出了一种SV-$k$-模式算法,用于对具有多值特征的分类数据进行聚类。在该算法中,定义了两个具有多值特征的对象之间的距离函数,并提出了聚类中心的多值模式表示。我们开发了一种启发式方法来在迭代聚类过程中更新聚类中心,以及一种初始化算法来选择初始聚类中心。分析了SV-$k$-模式算法的收敛性和复杂度。在来自五个不同应用的合成数据和真实数据上进行了实验。实验结果表明,在对真实数据进行聚类时,SV-$k$-模式算法比其他三种分类聚类算法表现更好,并且该算法可扩展到大数据。