用于视觉识别的上下文Transformer网络

Contextual Transformer Networks for Visual Recognition.

作者信息

Li Yehao, Yao Ting, Pan Yingwei, Mei Tao

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1489-1500. doi: 10.1109/TPAMI.2022.3164083. Epub 2023 Jan 6.

DOI:10.1109/TPAMI.2022.3164083

Abstract

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a 3×3 convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive 1×1 convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each 3×3 convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection, instance segmentation, and semantic segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at https://github.com/JDAI-CV/CoTNet.

摘要

带有自注意力机制的Transformer引发了自然语言处理领域的变革，最近还激发了Transformer风格架构设计的出现，这种设计在众多计算机视觉任务中取得了具有竞争力的成果。然而，大多数现有设计直接在二维特征图上使用自注意力机制，以基于每个空间位置上孤立的查询和键对来获取注意力矩阵，但却没有充分利用相邻键之间丰富的上下文信息。在这项工作中，我们设计了一种新颖的Transformer风格模块，即上下文Transformer（CoT）模块，用于视觉识别。这种设计充分利用输入键之间的上下文信息来指导动态注意力矩阵的学习，从而增强视觉表征能力。从技术上讲，CoT模块首先通过3×3卷积对输入键进行上下文编码，得到输入的静态上下文表示。我们进一步将编码后的键与输入查询连接起来，通过两个连续的1×1卷积来学习动态多头注意力矩阵。将学习到的注意力矩阵与输入值相乘，以实现输入的动态上下文表示。最后将静态和动态上下文表示的融合作为输出。我们的CoT模块很有吸引力，因为它可以很容易地替换ResNet架构中的每个3×3卷积，从而产生一种名为上下文Transformer网络（CoTNet）的Transformer风格主干网络。通过在广泛的应用（如图像识别、目标检测、实例分割和语义分割）上进行大量实验，我们验证了CoTNet作为更强主干网络的优越性。源代码可在https://github.com/JDAI-CV/CoTNet获取。

相似文献

Contextual Transformer Networks for Visual Recognition.用于视觉识别的上下文Transformer网络

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1489-1500. doi: 10.1109/TPAMI.2022.3164083. Epub 2023 Jan 6.

CoT-UNet++: A medical image segmentation method based on contextual transformer and dense connection.CoT-UNet++：一种基于上下文Transformer 和密集连接的医学图像分割方法。

Math Biosci Eng. 2023 Mar 1;20(5):8320-8336. doi: 10.3934/mbe.2023364.

P2T: Pyramid Pooling Transformer for Scene Understanding.P2T：用于场景理解的金字塔池化变换器

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12760-12771. doi: 10.1109/TPAMI.2022.3202765. Epub 2023 Oct 3.

Vision Transformer-based recognition of diabetic retinopathy grade.基于 Vision Transformer 的糖尿病视网膜病变分级识别。

Med Phys. 2021 Dec;48(12):7850-7863. doi: 10.1002/mp.15312. Epub 2021 Nov 16.

Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery.基于Transformer且带有动态注意力金字塔头的甚高分辨率遥感影像语义分割模型

Entropy (Basel). 2022 Nov 6;24(11):1619. doi: 10.3390/e24111619.

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么：语义信息对视觉表征的影响。

Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.

Deep High-Resolution Representation Learning for Visual Recognition.用于视觉识别的深度高分辨率表征学习

IEEE Trans Pattern Anal Mach Intell. 2021 Oct;43(10):3349-3364. doi: 10.1109/TPAMI.2020.2983686. Epub 2021 Sep 2.

CoT: Contourlet Transformer for Hierarchical Semantic Segmentation.CoT：用于分层语义分割的轮廓波变换网络

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):132-146. doi: 10.1109/TNNLS.2024.3367901. Epub 2025 Jan 7.

ViTT: Vision Transformer Tracker.ViTT：视觉Transformer跟踪器。

Sensors (Basel). 2021 Aug 20;21(16):5608. doi: 10.3390/s21165608.

GlobalSR: Global context network for single image super-resolution via deformable convolution attention and fast Fourier convolution.GlobalSR：基于可变形卷积注意力和快速傅里叶卷积的单图像超分辨率全局上下文网络。

Neural Netw. 2024 Dec;180:106686. doi: 10.1016/j.neunet.2024.106686. Epub 2024 Aug 31.

引用本文的文献

Performance of deep learning models for the classification and object detection of different oral white lesions using photographic images.使用摄影图像的深度学习模型对不同口腔白色病变进行分类和目标检测的性能

Sci Rep. 2025 Aug 22;15(1):30834. doi: 10.1038/s41598-025-14450-w.

Unsupervised cross-modal biomedical image fusion framework with dual-path detail enhancement and global context awareness.具有双路径细节增强和全局上下文感知的无监督跨模态生物医学图像融合框架

Biomed Opt Express. 2025 Jul 25;16(8):3378-3394. doi: 10.1364/BOE.562137. eCollection 2025 Aug 1.

A novel multi-modal retrieval framework for tracking vehicles using natural language descriptions.一种用于使用自然语言描述跟踪车辆的新型多模态检索框架。

PLoS One. 2025 Aug 11;20(8):e0327468. doi: 10.1371/journal.pone.0327468. eCollection 2025.

A study of motor imagery EEG classification based on feature fusion and attentional mechanisms.基于特征融合和注意力机制的运动想象脑电分类研究

Front Hum Neurosci. 2025 Jul 16;19:1611229. doi: 10.3389/fnhum.2025.1611229. eCollection 2025.

An exploratory framework for EEG-based monitoring of motivation and performance in athletic-like scenarios.一个用于在类似运动场景中基于脑电图监测动机和表现的探索性框架。

Sci Rep. 2025 Jul 18;15(1):26156. doi: 10.1038/s41598-025-05420-3.

Research on CTSA-DeepLabV3+ Urban Green Space Classification Model Based on GF-2 Images.基于高分二号影像的CTSA-DeepLabV3+城市绿地分类模型研究

Sensors (Basel). 2025 Jun 21;25(13):3862. doi: 10.3390/s25133862.

VMDU-net: a dual encoder multi-scale fusion network for polyp segmentation with Vision Mamba and Cross-Shape Transformer integration.VMDU-net：一种用于息肉分割的双编码器多尺度融合网络，集成了视觉曼巴和十字形变换器

Front Artif Intell. 2025 Jun 18;8:1557508. doi: 10.3389/frai.2025.1557508. eCollection 2025.

DASNet a dual branch multi level attention sheep counting network.DASNet是一种双分支多级注意力羊只计数网络。

Sci Rep. 2025 Jul 2;15(1):23228. doi: 10.1038/s41598-025-97929-w.

Local pattern aware 3D video swin transformer with masked autoencoding for realtime augmented reality gesture interaction.用于实时增强现实手势交互的具有掩码自动编码的局部模式感知3D视频斯温变压器

Sci Rep. 2025 Jul 1;15(1):21318. doi: 10.1038/s41598-025-05935-9.

Combining convolutional neural network with transformer to improve YOLOv7 for gas plume detection and segmentation in multibeam water column images.将卷积神经网络与Transformer相结合以改进YOLOv7用于多波束水柱图像中的气体羽流检测与分割

PeerJ Comput Sci. 2025 May 29;11:e2923. doi: 10.7717/peerj-cs.2923. eCollection 2025.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于视觉识别的上下文Transformer网络

Contextual Transformer Networks for Visual Recognition.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献