Institute of Research and Development, Duy Tan University, P809 7/25 Quang Trung, Danang 550000, Vietnam.
Comput Intell Neurosci. 2017;2017:8986360. doi: 10.1155/2017/8986360. Epub 2017 Dec 21.
With the advent of the -modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in -modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost considerably. A variety of initialization methods differ in how the heuristics chooses the set of initial centers. In this paper, we address the clustering problem for categorical data from the perspective of community detection. Instead of initializing modes and running several iterations, our scheme, CD-Clustering, builds an unweighted graph and detects highly cohesive groups of nodes using a fast community detection technique. The top- detected communities by size will define the modes. Evaluation on ten real categorical datasets shows that our method outperforms the existing initialization methods for -modes in terms of accuracy, precision, and recall in most of the cases.
随着 -modes 算法的出现,用于聚类分类数据的工具集拥有了一个在数据项数量上呈线性扩展的高效工具。然而,在 -modes 中随机初始化聚类中心使得如果不进行多次尝试,很难达到良好的聚类效果。最近提出的更好初始化方法是确定性的,并且大大降低了聚类成本。各种初始化方法在启发式方法选择初始中心集的方式上有所不同。在本文中,我们从社区检测的角度来解决分类数据的聚类问题。我们的方案 CD-Clustering 没有初始化 modes 并运行多个迭代,而是构建一个无权重图,并使用快速社区检测技术检测具有高度内聚性的节点群。根据大小检测到的顶级社区将定义 modes。对十个真实的分类数据集的评估表明,在大多数情况下,我们的方法在准确性、精度和召回率方面都优于 -modes 的现有初始化方法。