Suppr超能文献

用于视觉识别的上下文Transformer网络

Contextual Transformer Networks for Visual Recognition.

作者信息

Li Yehao, Yao Ting, Pan Yingwei, Mei Tao

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1489-1500. doi: 10.1109/TPAMI.2022.3164083. Epub 2023 Jan 6.

Abstract

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a 3×3 convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive 1×1 convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each 3×3 convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection, instance segmentation, and semantic segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at https://github.com/JDAI-CV/CoTNet.

摘要

带有自注意力机制的Transformer引发了自然语言处理领域的变革,最近还激发了Transformer风格架构设计的出现,这种设计在众多计算机视觉任务中取得了具有竞争力的成果。然而,大多数现有设计直接在二维特征图上使用自注意力机制,以基于每个空间位置上孤立的查询和键对来获取注意力矩阵,但却没有充分利用相邻键之间丰富的上下文信息。在这项工作中,我们设计了一种新颖的Transformer风格模块,即上下文Transformer(CoT)模块,用于视觉识别。这种设计充分利用输入键之间的上下文信息来指导动态注意力矩阵的学习,从而增强视觉表征能力。从技术上讲,CoT模块首先通过3×3卷积对输入键进行上下文编码,得到输入的静态上下文表示。我们进一步将编码后的键与输入查询连接起来,通过两个连续的1×1卷积来学习动态多头注意力矩阵。将学习到的注意力矩阵与输入值相乘,以实现输入的动态上下文表示。最后将静态和动态上下文表示的融合作为输出。我们的CoT模块很有吸引力,因为它可以很容易地替换ResNet架构中的每个3×3卷积,从而产生一种名为上下文Transformer网络(CoTNet)的Transformer风格主干网络。通过在广泛的应用(如图像识别、目标检测、实例分割和语义分割)上进行大量实验,我们验证了CoTNet作为更强主干网络的优越性。源代码可在https://github.com/JDAI-CV/CoTNet获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验