School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Wuhan institute of Data Intelligence, Wuhan University, Wuhan, 430072, China.
School of Electric Information, Wuhan University, Wuhan, 430072, China.
Neural Netw. 2024 Jun;174:106235. doi: 10.1016/j.neunet.2024.106235. Epub 2024 Mar 14.
Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then, the following self-attention layers construct the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform compared methods in both accuracy and FLOPs.
最近,视觉转换器 (ViT) 在图像识别方面取得了令人瞩目的性能,逐渐成为各种视觉任务中的强大骨干。为了满足 Transformer 的顺序输入,ViT 的尾部首先将每张图像分割成具有固定长度的一系列视觉令牌。然后,下面的自注意力层构建令牌之间的全局关系,为下游任务生成有用的表示。从经验上看,用更多的令牌表示图像会带来更好的性能,但自注意力层对令牌数量的二次计算复杂度会严重影响 ViT 的推断效率。为了减少计算量,一些剪枝方法逐步剪枝 Transformer 编码器中无信息的令牌,而不改变 Transformer 之前的令牌数量。实际上,作为 Transformer 编码器输入的更少的令牌可以直接降低下面的计算成本。本着这种精神,我们在本文中提出了多尾视觉转换器 (MT-ViT)。MT-ViT 采用多个尾部来生成不同长度的视觉序列,供下面的 Transformer 编码器使用。引入了尾部预测器来决定哪个尾部对于图像是最有效的,以产生准确的预测。这两个模块都以端到端的方式进行优化,使用 Gumbel-Softmax 技巧。在 ImageNet-1K 上的实验表明,MT-ViT 可以在不降低准确性的情况下显著减少 FLOPs,并且在准确性和 FLOPs 方面都优于比较方法。