Suppr超能文献

双视觉变换器

Dual Vision Transformer.

作者信息

Yao Ting, Li Yehao, Pan Yingwei, Wang Yu, Zhang Xiao-Ping, Mei Tao

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10870-10882. doi: 10.1109/TPAMI.2023.3268446. Epub 2023 Aug 7.

Abstract

Recent advances have presented several strategies to mitigate the computations of self-attention mechanism with high-resolution inputs. Many of these works consider decomposing the global self-attention procedure over image patches into regional and local feature extraction procedures that each incurs a smaller computational complexity. Despite good efficiency, these approaches seldom explore the holistic interactions among all patches, and are thus difficult to fully capture the global semantics. In this paper, we propose a novel Transformer architecture that elegantly exploits the global semantics for self-attention learning, namely Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer local pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to capitalize on global semantics to boost self-attention learning without compromising much computational complexity. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with comparable training complexity.

摘要

最近的进展提出了几种策略,以减轻高分辨率输入下自注意力机制的计算量。这些工作中的许多都考虑将图像块上的全局自注意力过程分解为区域和局部特征提取过程,每个过程的计算复杂度都较小。尽管效率很高,但这些方法很少探索所有块之间的整体交互,因此难以完全捕捉全局语义。在本文中,我们提出了一种新颖的Transformer架构,即双视觉Transformer(Dual-ViT),它巧妙地利用全局语义进行自注意力学习。新架构包含一条关键的语义路径,该路径可以更有效地将令牌向量压缩为全局语义,同时降低复杂度。然后,这种压缩后的全局语义通过另一条构建的像素路径,作为有用的先验信息用于学习更精细的局部像素级细节。语义路径和像素路径整合在一起并联合训练,通过两条路径并行传播增强的自注意力信息。因此,Dual-ViT能够利用全局语义来促进自注意力学习,而不会增加太多计算复杂度。我们通过实验证明,在可比的训练复杂度下,Dual-ViT比最先进的Transformer架构具有更高的准确率。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验