Cheng Dongdong, Li Ya, Xia Shuyin, Wang Guoyin, Huang Jinlong, Zhang Sulan
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17202-17215. doi: 10.1109/TNNLS.2023.3300916. Epub 2024 Dec 2.
Density peaks clustering algorithm (DP) has difficulty in clustering large-scale data, because it requires the distance matrix to compute the density and -distance for each object, which has time complexity. Granular ball (GB) is a coarse-grained representation of data. It is based on the fact that an object and its local neighbors have similar distribution and they have high possibility of belonging to the same class. It has been introduced into supervised learning by Xia et al. to improve the efficiency of supervised learning, such as support vector machine, -nearest neighbor classification, rough set, etc. Inspired by the idea of GB, we introduce it into unsupervised learning for the first time and propose a GB-based DP algorithm, called GB-DP. First, it generates GBs from the original data with an unsupervised partitioning method. Then, it defines the density of GBs, instead of the density of objects, according to the centers, radius, and distances between its members and centers, without setting any parameters. After that, it computes the distance between the centers of GBs as the distance between GBs and defines the -distance of GBs. Finally, it uses GBs' density and -distance to plot the decision graph, employs DP algorithm to cluster them, and expands the clustering result to the original data. Since there is no need to calculate the distance between any two objects and the number of GBs is far less than the scale of a data, it greatly reduces the running time of DP algorithm. By comparing with -means, ball -means, DP, DPC-KNN-PCA, FastDPeak, and DLORE-DP, GB-DP can get similar or even better clustering results in much less running time without setting any parameters. The source code is available at https://github.com/DongdongCheng/GB-DP.
密度峰值聚类算法(DP)在对大规模数据进行聚类时存在困难,因为它需要距离矩阵来计算每个对象的密度和距离,这具有时间复杂度。粒度球(GB)是数据的一种粗粒度表示。它基于这样一个事实,即一个对象及其局部邻域具有相似的分布,并且它们很有可能属于同一类。Xia等人已将其引入监督学习中,以提高监督学习的效率,如支持向量机、最近邻分类、粗糙集等。受GB思想的启发,我们首次将其引入无监督学习,并提出了一种基于GB的DP算法,称为GB-DP。首先,它使用无监督划分方法从原始数据中生成GB。然后,它根据GB的中心、半径及其成员与中心之间的距离来定义GB的密度,而不是对象的密度,无需设置任何参数。之后,它计算GB中心之间的距离作为GB之间的距离,并定义GB的距离。最后,它使用GB的密度和距离来绘制决策图,采用DP算法对它们进行聚类,并将聚类结果扩展到原始数据。由于无需计算任意两个对象之间的距离,且GB的数量远小于数据规模,因此大大减少了DP算法的运行时间。通过与均值法、球均值法、DP、DPC-KNN-PCA、FastDPeak和DLORE-DP进行比较,GB-DP在不设置任何参数的情况下,能够在更短的运行时间内获得相似甚至更好的聚类结果。源代码可在https://github.com/DongdongCheng/GB-DP获取。