Yip Andy M, Ding Chris, Chan Tony F
Department of Mathematics, National University of Singapore, 2, Science Drive 2, Singapore 117543, Singapore.
IEEE Trans Pattern Anal Mach Intell. 2006 Jun;28(6):877-89. doi: 10.1109/TPAMI.2006.117.
Density-based clustering has the advantages for 1) allowing arbitrary shape of cluster and 2) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But, when clusters become closed to each other, CIFs still clearly reveal cluster centers, cluster boundaries, and degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on density functions obtained by kernel density estimation, which are often oscillatory or oversmoothed. These problems of kernel density estimation are resolved using Level Set Methods and related techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presented which illustrate the advantages of our approach.
1)允许聚类具有任意形状;2)不需要将聚类数量作为输入。然而,当聚类相互接触时,聚类中心和聚类边界(作为密度分布的峰值和谷值)都会变得模糊且难以确定。我们引入了聚类强度函数(CIF)的概念,它捕捉了聚类的重要特征。当聚类分得很开时,CIF 类似于密度函数。但是,当聚类彼此靠近时,CIF 仍然能够清晰地揭示聚类中心、聚类边界以及每个数据点属于其所属聚类的隶属度。基于这些函数通过寻找峰值和谷值进行聚类比基于核密度估计得到的密度函数进行聚类更稳健,后者往往会出现振荡或过度平滑的情况。使用水平集方法和相关技术解决了核密度估计的这些问题。文中给出了与两种现有的基于密度的方法(谷值寻找和 DBSCAN)的比较,这说明了我们方法的优点。