Yellamraju Tarun, Boutin Mireille
IEEE Trans Image Process. 2018 Apr;27(4):1927-1938. doi: 10.1109/TIP.2017.2789327.
Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear subspace is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.
众所周知,对高维数据集进行聚类非常困难。在本文中,我们表明当要聚类的点对应于图像时情况并非如此。更具体地说,图像数据集显示出有很多结构,多到将该数据集投影到随机的一维线性子空间上很可能会揭示图像之间的二元分组。基于这一观察结果,我们提出了一种量化数据集可聚类性的方法。该方法基于数据投影到随机直线上的可聚类性度量(S)(在一维中)的概率密度。在将图像数据集的可聚类性与合成生成的聚类的可聚类性进行比较之后,我们得出结论,我们在图像数据集中发现的这些有趣结构不符合传统意义上的聚类概念。我们的观察结果还进一步提出了一种以分层方式对高维数据进行聚类的快速方法;在每个阶段,数据根据在数据的一维随机投影中找到的二元聚类被分成两部分。由于大多数计算是在一维中执行的,所以这种方法极其高效。但尽管其简单,它总体上比现有的高维聚类方法实现了更好的聚类质量,不仅适用于表示图像数据的数据集,也适用于其他真实数据集。我们的结果凸显了重新审视我们对高维聚类以及诸如图像集等真实数据集的几何结构的假设的必要性。