Suppr超能文献

TCFormer:通过令牌聚类变换器实现视觉识别

TCFormer: Visual Recognition via Token Clustering Transformer.

作者信息

Zeng Wang, Jin Sheng, Xu Lumin, Liu Wentao, Qian Chen, Ouyang Wanli, Luo Ping, Wang Xiaogang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9521-9535. doi: 10.1109/TPAMI.2024.3425768. Epub 2024 Nov 6.

Abstract

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer.

摘要

Transformer在计算机视觉领域得到了广泛应用,并取得了显著成功。大多数先进方法将图像分割为规则网格,并用视觉令牌表示每个网格区域。然而,固定的令牌分布忽略了不同图像区域的语义含义,导致性能欠佳。为解决此问题,我们提出了令牌聚类Transformer(TCFormer),它基于语义含义生成动态视觉令牌。我们的动态令牌具有两个关键特性:(1)使用相同的视觉令牌表示语义相似的图像区域,即使这些区域不相邻;(2)关注具有有价值细节的区域,并用精细令牌表示它们。通过在包括图像分类、人体姿态估计、语义分割和目标检测在内的各种应用中的广泛实验,我们证明了TCFormer的有效性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验