• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

VOLO:用于视觉识别的视觉展望器

VOLO: Vision Outlooker for Visual Recognition.

作者信息

Yuan Li, Hou Qibin, Jiang Zihang, Feng Jiashi, Yan Shuicheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6575-6586. doi: 10.1109/TPAMI.2022.3206108. Epub 2023 Apr 3.

DOI:10.1109/TPAMI.2022.3206108
PMID:36094970
Abstract

Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image. Unlike self-attention that focuses on modeling global dependencies of local features at a coarse level, our outlook attention targets at encoding finer-level features, which is critical for recognition but ignored by self-attention. Outlook attention breaks the bottleneck of self-attention whose computation cost scales quadratically with the input spatial dimension, and thus is much more memory efficient. Compared to our Tokens-To-Token Vision Transformer (T2T-ViT), VOLO can more efficiently encode fine-level features that are essential for high-performance visual recognition. Experiments show that with only 26.6 M learnable parameters, VOLO achieves 84.2% top-1 accuracy on ImageNet-1 K without using extra training data, 2.7% better than T2T-ViT with a comparable number of parameters. When the model size is scaled up to 296 M parameters, its performance can be further improved to 87.1%, setting a new record for ImageNet-1 K classification. In addition, we also take the proposed VOLO as pretrained models and report superior performance on downstream tasks, such as semantic segmentation. Code is available at https://github.com/sail-sg/volo.

摘要

最近,视觉Transformer(ViT)在视觉识别领域得到了广泛探索。由于在编码精细级别特征方面效率较低,当在像ImageNet这样的中型数据集上从头开始训练时,ViT的性能仍然不如当前最先进的卷积神经网络(CNN)。通过实验分析,我们发现原因有两个:1)输入图像的简单令牌化无法对诸如边缘和线条等重要局部结构进行建模,导致训练样本效率低下;2)ViT的冗余注意力主干设计在固定计算预算和有限训练样本的情况下导致特征丰富度有限。为了克服这些限制,我们提出了一种新的简单通用架构,称为视觉展望者(VOLO),它实现了一种新颖的展望注意力操作,以滑动窗口的方式在输入图像上动态地进行局部特征聚合机制。与在粗略级别上专注于对局部特征的全局依赖性进行建模的自注意力不同,我们的展望注意力旨在编码更精细级别特征,这对于识别至关重要,但被自注意力忽略。展望注意力打破了自注意力的瓶颈,其计算成本与输入空间维度呈二次方关系,因此内存效率更高。与我们的令牌到令牌视觉Transformer(T2T-ViT)相比,VOLO可以更有效地编码对高性能视觉识别至关重要的精细级别特征。实验表明,VOLO仅具有2660万个可学习参数,在不使用额外训练数据的情况下,在ImageNet-1K上实现了84.2%的top-1准确率,比具有可比参数数量的T2T-ViT高出2.7%。当模型规模扩大到2.96亿参数时,其性能可进一步提高到87.1%,在ImageNet-1K分类方面创下新纪录。此外,我们还将提出的VOLO用作预训练模型,并在诸如语义分割等下游任务上报告了卓越的性能。代码可在https://github.com/sail-sg/volo获取。

相似文献

1
VOLO: Vision Outlooker for Visual Recognition.VOLO:用于视觉识别的视觉展望器
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6575-6586. doi: 10.1109/TPAMI.2022.3206108. Epub 2023 Apr 3.
2
Token Selection is a Simple Booster for Vision Transformers.令牌选择是视觉Transformer的一种简单增强方法。
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12738-12746. doi: 10.1109/TPAMI.2022.3208922. Epub 2023 Oct 3.
3
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition.视觉排列器:一种用于视觉识别的可排列的类似多层感知器的架构。
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):1328-1334. doi: 10.1109/TPAMI.2022.3145427. Epub 2022 Dec 5.
4
UniFormer: Unifying Convolution and Self-Attention for Visual Recognition.统一卷积与自注意力机制用于视觉识别的UniFormer
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12581-12600. doi: 10.1109/TPAMI.2023.3282631. Epub 2023 Sep 5.
5
DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.DiagSWin:一种具有对角线形状窗口的多尺度视觉转换器,用于目标检测和分割。
Neural Netw. 2024 Dec;180:106653. doi: 10.1016/j.neunet.2024.106653. Epub 2024 Aug 22.
6
DctViT: Discrete Cosine Transform meet vision transformers.DctViT:离散余弦变换遇见视觉Transformer。
Neural Netw. 2024 Apr;172:106139. doi: 10.1016/j.neunet.2024.106139. Epub 2024 Jan 19.
7
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks.用于高效视觉Transformer和卷积神经网络的动态空间稀疏化
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10883-10897. doi: 10.1109/TPAMI.2023.3263826. Epub 2023 Aug 7.
8
PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention.PLG-ViT:具有并行局部和全局自注意力的视觉 Transformer。
Sensors (Basel). 2023 Mar 25;23(7):3447. doi: 10.3390/s23073447.
9
GFNet: Global Filter Networks for Visual Recognition.GFNet:用于视觉识别的全局滤波器网络
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10960-10973. doi: 10.1109/TPAMI.2023.3263824. Epub 2023 Aug 7.
10
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition.Conv2Former:一种用于视觉识别的简单的类Transformer卷积网络。
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8274-8283. doi: 10.1109/TPAMI.2024.3401450. Epub 2024 Nov 6.

引用本文的文献

1
Enhanced YOLOv11 Framework for Accurate Multi-Fault Detection in UAV Photovoltaic Inspection.用于无人机光伏检测中精确多故障检测的增强型YOLOv11框架
Sensors (Basel). 2025 Aug 26;25(17):5311. doi: 10.3390/s25175311.
2
LA-EAD: Simple and Effective Methods for Improving Logical Anomaly Detection Capability.LA-EAD:提高逻辑异常检测能力的简单有效方法。
Sensors (Basel). 2025 Aug 13;25(16):5016. doi: 10.3390/s25165016.
3
An exploration of the spatial and temporal factors influencing industrial park vitality using multi-source geospatial data.
利用多源地理空间数据对影响工业园区活力的时空因素进行探索。
Sci Rep. 2025 Aug 12;15(1):29584. doi: 10.1038/s41598-025-15294-0.
4
GrotUNet: a novel leaf segmentation method.GrotUNet:一种新型叶片分割方法。
Front Plant Sci. 2025 Jul 10;16:1378958. doi: 10.3389/fpls.2025.1378958. eCollection 2025.
5
SODU2-NET: a novel deep learning-based approach for salient object detection utilizing U-NET.SODU2-NET:一种基于深度学习的利用U-NET进行显著目标检测的新方法。
PeerJ Comput Sci. 2025 May 19;11:e2623. doi: 10.7717/peerj-cs.2623. eCollection 2025.
6
COVID-19 recognition from chest X-ray images by combining deep learning with transfer learning.通过将深度学习与迁移学习相结合从胸部X光图像中识别新型冠状病毒肺炎
Digit Health. 2025 Feb 13;11:20552076251319667. doi: 10.1177/20552076251319667. eCollection 2025 Jan-Dec.
7
Enhancing Food Image Recognition by Multi-Level Fusion and the Attention Mechanism.通过多级融合和注意力机制增强食品图像识别
Foods. 2025 Jan 31;14(3):461. doi: 10.3390/foods14030461.
8
Leveraging two-dimensional pre-trained vision transformers for three-dimensional model generation via masked autoencoders.利用二维预训练视觉变换器通过掩码自动编码器生成三维模型。
Sci Rep. 2025 Jan 25;15(1):3164. doi: 10.1038/s41598-025-87376-y.
9
FundusNet: A Deep-Learning Approach for Fast Diagnosis of Neurodegenerative and Eye Diseases Using Fundus Images.眼底网:一种使用眼底图像快速诊断神经退行性疾病和眼部疾病的深度学习方法。
Bioengineering (Basel). 2025 Jan 13;12(1):57. doi: 10.3390/bioengineering12010057.
10
A benchmark of deep learning approaches to predict lung cancer risk using national lung screening trial cohort.使用国家肺癌筛查试验队列预测肺癌风险的深度学习方法基准。
Sci Rep. 2025 Jan 11;15(1):1736. doi: 10.1038/s41598-024-84193-7.