Wu Yu-Huan, Liu Yun, Zhan Xin, Cheng Ming-Ming
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12760-12771. doi: 10.1109/TPAMI.2022.3202765. Epub 2023 Oct 3.
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
最近,视觉Transformer通过推动各种视觉任务的技术前沿取得了巨大成功。视觉Transformer中最具挑战性的问题之一是图像令牌的长序列长度导致高计算成本(二次复杂度)。解决这个问题的一个流行方法是使用单个池化操作来减少序列长度。本文考虑如何改进现有的视觉Transformer,其中通过单个池化操作提取的池化特征似乎不够强大。为此,我们注意到金字塔池化由于其在上下文抽象方面的强大能力,已被证明在各种视觉任务中有效。然而,金字塔池化尚未在骨干网络设计中得到探索。为了弥补这一差距,我们建议在视觉Transformer中将金字塔池化应用于多头自注意力(MHSA),同时减少序列长度并捕获强大的上下文特征。通过我们基于池化的MHSA,我们构建了一个通用的视觉Transformer骨干网络,称为金字塔池化Transformer(P2T)。大量实验表明,当将P2T用作骨干网络时,与先前基于CNN和Transformer的网络相比,它在图像分类、语义分割、目标检测和实例分割等各种视觉任务中表现出显著优势。代码将在https://github.com/yuhuan-wu/P2T上发布。