一种基于相邻网格搜索的新型聚类方法。

A Novel Clustering Method Based on Adjacent Grids Searching.

作者信息

Li Zhimeng, Zhong Wen, Liao Weiwen, Zhao Jian, Yu Ming, He Gaiyun

机构信息

School of Control and Mechanical Engineering, Tianjin Chengjian University, Tianjin 300384, China.

School of Computer and Information Engineering, Tianjin Chengjian University, Tianjin 300384, China.

出版信息

Entropy (Basel). 2023 Sep 15;25(9):1342. doi: 10.3390/e25091342.

DOI:10.3390/e25091342

PMID:37761640

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10528124/

Abstract

Clustering is used to analyze the intrinsic structure of a dataset based on the similarity of datapoints. Its widespread use, from image segmentation to object recognition and information retrieval, requires great robustness in the clustering process. In this paper, a novel clustering method based on adjacent grid searching (CAGS) is proposed. The CAGS consists of two steps: a strategy based on adaptive grid-space construction and a clustering strategy based on adjacent grid searching. In the first step, a multidimensional grid space is constructed to provide a quantization structure of the input dataset. The noise and cluster halo are automatically distinguished according to grid density. Moreover, the adaptive grid generating process solves the common problem of grid clustering, in which the number of cells increases sharply with the dimension. In the second step, a two-stage traversal process is conducted to accomplish the cluster recognition. The cluster cores with arbitrary shapes can be found by concealing the halo points. As a result, the number of clusters will be easily identified by CAGS. Therefore, CAGS has the potential to be widely used for clustering datasets with different characteristics. We test the clustering performance of CAGS through six different types of datasets: dataset with noise, large-scale dataset, high-dimensional dataset, dataset with arbitrary shapes, dataset with large differences in density between classes, and dataset with high overlap between classes. Experimental results show that CAGS, which performed best on 10 out of 11 tests, outperforms the state-of-the-art clustering methods in all the above datasets.

摘要

聚类用于基于数据点的相似性来分析数据集的内在结构。它在从图像分割到目标识别和信息检索等广泛领域的应用，要求在聚类过程中具有很强的鲁棒性。本文提出了一种基于相邻网格搜索的新型聚类方法（CAGS）。CAGS由两个步骤组成：基于自适应网格空间构建的策略和基于相邻网格搜索的聚类策略。在第一步中，构建一个多维网格空间以提供输入数据集的量化结构。根据网格密度自动区分噪声和聚类光晕。此外，自适应网格生成过程解决了网格聚类的常见问题，即单元格数量会随着维度急剧增加。在第二步中，进行两阶段遍历过程以完成聚类识别。通过隐藏光晕点可以找到任意形状的聚类核心。结果，CAGS能够轻松识别聚类数量。因此，CAGS有潜力广泛应用于对具有不同特征的数据集进行聚类。我们通过六种不同类型的数据集测试了CAGS的聚类性能：含噪声数据集、大规模数据集、高维数据集、任意形状数据集、类间密度差异大的数据集以及类间重叠度高的数据集。实验结果表明，CAGS在11次测试中的10次表现最佳，在上述所有数据集中均优于当前最先进的聚类方法。