Zeng Wang, Jin Sheng, Xu Lumin, Liu Wentao, Qian Chen, Ouyang Wanli, Luo Ping, Wang Xiaogang
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):9521-9535. doi: 10.1109/TPAMI.2024.3425768. Epub 2024 Nov 6.
Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer.
Transformer在计算机视觉领域得到了广泛应用,并取得了显著成功。大多数先进方法将图像分割为规则网格,并用视觉令牌表示每个网格区域。然而,固定的令牌分布忽略了不同图像区域的语义含义,导致性能欠佳。为解决此问题,我们提出了令牌聚类Transformer(TCFormer),它基于语义含义生成动态视觉令牌。我们的动态令牌具有两个关键特性:(1)使用相同的视觉令牌表示语义相似的图像区域,即使这些区域不相邻;(2)关注具有有价值细节的区域,并用精细令牌表示它们。通过在包括图像分类、人体姿态估计、语义分割和目标检测在内的各种应用中的广泛实验,我们证明了TCFormer的有效性。