VOLO：用于视觉识别的视觉展望器

VOLO: Vision Outlooker for Visual Recognition.

作者信息

Yuan Li, Hou Qibin, Jiang Zihang, Feng Jiashi, Yan Shuicheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6575-6586. doi: 10.1109/TPAMI.2022.3206108. Epub 2023 Apr 3.

DOI:10.1109/TPAMI.2022.3206108

Abstract

Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image. Unlike self-attention that focuses on modeling global dependencies of local features at a coarse level, our outlook attention targets at encoding finer-level features, which is critical for recognition but ignored by self-attention. Outlook attention breaks the bottleneck of self-attention whose computation cost scales quadratically with the input spatial dimension, and thus is much more memory efficient. Compared to our Tokens-To-Token Vision Transformer (T2T-ViT), VOLO can more efficiently encode fine-level features that are essential for high-performance visual recognition. Experiments show that with only 26.6 M learnable parameters, VOLO achieves 84.2% top-1 accuracy on ImageNet-1 K without using extra training data, 2.7% better than T2T-ViT with a comparable number of parameters. When the model size is scaled up to 296 M parameters, its performance can be further improved to 87.1%, setting a new record for ImageNet-1 K classification. In addition, we also take the proposed VOLO as pretrained models and report superior performance on downstream tasks, such as semantic segmentation. Code is available at https://github.com/sail-sg/volo.

摘要

最近，视觉Transformer（ViT）在视觉识别领域得到了广泛探索。由于在编码精细级别特征方面效率较低，当在像ImageNet这样的中型数据集上从头开始训练时，ViT的性能仍然不如当前最先进的卷积神经网络（CNN）。通过实验分析，我们发现原因有两个：1）输入图像的简单令牌化无法对诸如边缘和线条等重要局部结构进行建模，导致训练样本效率低下；2）ViT的冗余注意力主干设计在固定计算预算和有限训练样本的情况下导致特征丰富度有限。为了克服这些限制，我们提出了一种新的简单通用架构，称为视觉展望者（VOLO），它实现了一种新颖的展望注意力操作，以滑动窗口的方式在输入图像上动态地进行局部特征聚合机制。与在粗略级别上专注于对局部特征的全局依赖性进行建模的自注意力不同，我们的展望注意力旨在编码更精细级别特征，这对于识别至关重要，但被自注意力忽略。展望注意力打破了自注意力的瓶颈，其计算成本与输入空间维度呈二次方关系，因此内存效率更高。与我们的令牌到令牌视觉Transformer（T2T-ViT）相比，VOLO可以更有效地编码对高性能视觉识别至关重要的精细级别特征。实验表明，VOLO仅具有2660万个可学习参数，在不使用额外训练数据的情况下，在ImageNet-1K上实现了84.2%的top-1准确率，比具有可比参数数量的T2T-ViT高出2.7%。当模型规模扩大到2.96亿参数时，其性能可进一步提高到87.1%，在ImageNet-1K分类方面创下新纪录。此外，我们还将提出的VOLO用作预训练模型，并在诸如语义分割等下游任务上报告了卓越的性能。代码可在https://github.com/sail-sg/volo获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

VOLO：用于视觉识别的视觉展望器

VOLO: Vision Outlooker for Visual Recognition.

作者信息

出版信息

相似文献

引用本文的文献

VOLO：用于视觉识别的视觉展望器

VOLO: Vision Outlooker for Visual Recognition.

作者信息

出版信息

相似文献

引用本文的文献