Suppr超能文献

ECCT:基于伪孪生视觉 Transformer 和多视图增强的高效对比聚类。

ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation.

机构信息

School of Computer and Information, Hefei University of Technology, Hefei, 230601, China; Intelligent Manufacturing Technology Research Institute, Hefei University of Technology, Hefei, 230601, China.

School of Computer and Information, Hefei University of Technology, Hefei, 230601, China.

出版信息

Neural Netw. 2024 Dec;180:106684. doi: 10.1016/j.neunet.2024.106684. Epub 2024 Sep 2.

Abstract

Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenges in capturing global information and preserving semantic continuity. Additionally, these methods often exhibit relatively singular feature distributions, limiting the full potential of contrastive learning in clustering. These problems can have a negative impact on the performance of image clustering. To address the above problems, we propose a deep clustering framework termed Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation (ECCT). The core idea is to introduce Vision Transformer (ViT) to provide the global view, and improve it with Hilbert Patch Embedding (HPE) module to construct a new ViT branch. Finally, we fuse the features extracted from the two ViT branches to obtain both global view and semantic coherence. In addition, we employ multi-view random aggressive augmentation to broaden the feature distribution, enabling the model to learn more comprehensive and richer contrastive features. Our results on five datasets demonstrate that ECCT outperforms previous clustering methods. In particular, the ARI metric of ECCT on the STL-10 (ImageNet-Dogs) dataset is 0.852 (0.424), which is 10.3% (4.8%) higher than the best baseline.

摘要

图像聚类旨在将一组未标记的图像分为多个簇。最近,基于对比学习的聚类方法由于能够学习判别特征表示而受到关注。然而,现有的聚类算法在捕获全局信息和保持语义连续性方面面临挑战。此外,这些方法通常表现出相对单一的特征分布,限制了对比学习在聚类中的全部潜力。这些问题可能会对图像聚类的性能产生负面影响。为了解决上述问题,我们提出了一种称为通过伪孪生视觉 Transformer 和多视图增强的高效对比聚类的深度聚类框架(ECCT)。核心思想是引入视觉 Transformer(ViT)提供全局视图,并使用 Hilbert 补丁嵌入(HPE)模块对其进行改进,以构建新的 ViT 分支。最后,我们融合从两个 ViT 分支中提取的特征,以获得全局视图和语义连贯性。此外,我们采用多视图随机激进增强来拓宽特征分布,使模型能够学习更全面和更丰富的对比特征。我们在五个数据集上的结果表明,ECCT 优于以前的聚类方法。特别是,ECCT 在 STL-10(ImageNet-Dogs)数据集上的 ARI 指标为 0.852(0.424),比最佳基线高 10.3%(4.8%)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验