Suppr超能文献

DiagSWin:一种具有对角线形状窗口的多尺度视觉转换器,用于目标检测和分割。

DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.

机构信息

Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.

Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.

出版信息

Neural Netw. 2024 Dec;180:106653. doi: 10.1016/j.neunet.2024.106653. Epub 2024 Aug 22.

Abstract

Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.

摘要

最近,基于视觉的 Transformer 及其变体在各种计算机视觉任务中表现出色,这要归功于它们通过自注意力捕捉全局视觉依赖关系的能力。然而,全局自注意力由于二次计算开销,计算成本很高,尤其是对于高分辨率视觉任务(例如目标检测和语义分割)。最近的许多工作都试图通过应用细粒度的局部注意力来降低成本,但这些方法削弱了原始自注意力机制的远程建模能力。此外,这些方法通常在每个层内具有相似的感受野,从而限制了每个自注意力层捕获多尺度特征的能力,导致在处理具有不同尺度对象的图像时性能下降。为了解决这些问题,我们为每个注意力层在混合尺度上对角区域的建模注意力开发了 Diagonal-shaped Window(DiagSWin)注意力机制。DiagSWin 注意力的关键思想是在令牌中注入多尺度感受野大小:在计算自注意力矩阵之前,每个令牌以细粒度关注其最近的周围令牌,以粗粒度关注远处的令牌。这种机制能够有效地捕获多尺度上下文信息,同时降低计算复杂度。在 DiagSwin 注意力的基础上,我们提出了一种新的 Vision Transformer 变体,称为 DiagSWin Transformer,并在各种任务的广泛实验中证明了其优越性。具体来说,具有大尺寸的 DiagSwin Transformer 达到了 84.4%的 Top-1 准确率,并且在模型大小和计算成本减少 40%的情况下,优于 SOTA 的 CSWin Transformer 在 ImageNet 上的表现。作为骨干网络,DiagSWin Transformer 相对于当前的 SOTA 模块有显著的改进。此外,我们的 DiagSWin-Base 模型在 COCO 上的目标检测和分割中达到了 51.1box mAP 和 45.8mask mAP,在 ADE20K 上的语义分割中达到了 52.3mIoU。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验