图像及其他“真实”高维数据的可聚类性与聚类

Clusterability and Clustering of Images and Other "Real" High-Dimensional Data.

作者信息

Yellamraju Tarun, Boutin Mireille

出版信息

IEEE Trans Image Process. 2018 Apr;27(4):1927-1938. doi: 10.1109/TIP.2017.2789327.

DOI:10.1109/TIP.2017.2789327

Abstract

Clustering a high-dimensional data set is known to be very difficult. In this paper, we show that this is not the case when the points to cluster correspond to images. More specifically, image data sets are shown to have a lot of structures, so much, so that projecting the set onto a random 1D linear subspace is likely to uncover a binary grouping among the images. Based on this observation, we propose a method to quantify the clusterability of a data set. The method is based on the probability density of a measure (S) of clusterability (in 1D) of the projection of the data onto a random line. After comparing the clusterability of image datasets with that of synthetically generated clusters, we conclude that these intriguing structures we find in image datasets do not fit the notion of clusters in the traditional sense. Further suggested by our observation is a fast method for clustering high-dimensional data in a hierarchical fashion; at each stage, the data is partitioned into two based on the binary clustering found in a 1D random projection of the data. Since most of the computations are performed in 1D, this approach is extremely efficient. But despite its simplicity, it achieves overall a better quality of clustering than existing high-dimensional clustering methods, not only for datasets representing image data, but for other real data sets as well. Our results highlight the need to re-examine our assumptions about high-dimensional clustering and the geometry of real datasets such as sets of images.

摘要

众所周知，对高维数据集进行聚类非常困难。在本文中，我们表明当要聚类的点对应于图像时情况并非如此。更具体地说，图像数据集显示出有很多结构，多到将该数据集投影到随机的一维线性子空间上很可能会揭示图像之间的二元分组。基于这一观察结果，我们提出了一种量化数据集可聚类性的方法。该方法基于数据投影到随机直线上的可聚类性度量（S）（在一维中）的概率密度。在将图像数据集的可聚类性与合成生成的聚类的可聚类性进行比较之后，我们得出结论，我们在图像数据集中发现的这些有趣结构不符合传统意义上的聚类概念。我们的观察结果还进一步提出了一种以分层方式对高维数据进行聚类的快速方法；在每个阶段，数据根据在数据的一维随机投影中找到的二元聚类被分成两部分。由于大多数计算是在一维中执行的，所以这种方法极其高效。但尽管其简单，它总体上比现有的高维聚类方法实现了更好的聚类质量，不仅适用于表示图像数据的数据集，也适用于其他真实数据集。我们的结果凸显了重新审视我们对高维聚类以及诸如图像集等真实数据集的几何结构的假设的必要性。

相似文献

Clusterability and Clustering of Images and Other "Real" High-Dimensional Data.

IEEE Trans Image Process. 2018 Apr;27(4):1927-1938. doi: 10.1109/TIP.2017.2789327.

Sparse clusterability: testing for cluster structure in high dimensions.

BMC Bioinformatics. 2023 Mar 31;24(1):125. doi: 10.1186/s12859-023-05210-6.

Principal Cluster Axes: A Projection Pursuit Index for the Preservation of Cluster Structures in the Presence of Data Reduction.

Multivariate Behav Res. 2012 Jun 18;47(3):463-92. doi: 10.1080/00273171.2012.673952.

Descriptive statistics and visualization of data from the datasets package with implications for clusterability.

Data Brief. 2019 May 24;25:104004. doi: 10.1016/j.dib.2019.104004. eCollection 2019 Aug.

Human Motion Segmentation via Robust Kernel Sparse Subspace Clustering.

IEEE Trans Image Process. 2018;27(1):135-150. doi: 10.1109/TIP.2017.2738562.

Multi-Orientation Scene Text Detection with Adaptive Clustering.

IEEE Trans Pattern Anal Mach Intell. 2015 Sep;37(9):1930-7. doi: 10.1109/TPAMI.2014.2388210.

Machine-learned cluster identification in high-dimensional data.

J Biomed Inform. 2017 Feb;66:95-104. doi: 10.1016/j.jbi.2016.12.011. Epub 2016 Dec 28.

Robust Subspace Clustering With Complex Noise.

IEEE Trans Image Process. 2015 Nov;24(11):4001-13. doi: 10.1109/TIP.2015.2456504. Epub 2015 Jul 15.

Visual MRI: merging information visualization and non-parametric clustering techniques for MRI dataset analysis.

Artif Intell Med. 2008 Nov;44(3):183-99. doi: 10.1016/j.artmed.2008.06.006. Epub 2008 Sep 4.

An effective density-based clustering and dynamic maintenance framework for evolving medical data streams.

Int J Med Inform. 2019 Jun;126:176-186. doi: 10.1016/j.ijmedinf.2019.03.016. Epub 2019 Mar 28.

引用本文的文献

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

Sparse clusterability: testing for cluster structure in high dimensions.

BMC Bioinformatics. 2023 Mar 31;24(1):125. doi: 10.1186/s12859-023-05210-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

图像及其他“真实”高维数据的可聚类性与聚类

Clusterability and Clustering of Images and Other "Real" High-Dimensional Data.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献