Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.
Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.
Neural Netw. 2024 Dec;180:106653. doi: 10.1016/j.neunet.2024.106653. Epub 2024 Aug 22.
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
最近,基于视觉的 Transformer 及其变体在各种计算机视觉任务中表现出色,这要归功于它们通过自注意力捕捉全局视觉依赖关系的能力。然而,全局自注意力由于二次计算开销,计算成本很高,尤其是对于高分辨率视觉任务(例如目标检测和语义分割)。最近的许多工作都试图通过应用细粒度的局部注意力来降低成本,但这些方法削弱了原始自注意力机制的远程建模能力。此外,这些方法通常在每个层内具有相似的感受野,从而限制了每个自注意力层捕获多尺度特征的能力,导致在处理具有不同尺度对象的图像时性能下降。为了解决这些问题,我们为每个注意力层在混合尺度上对角区域的建模注意力开发了 Diagonal-shaped Window(DiagSWin)注意力机制。DiagSWin 注意力的关键思想是在令牌中注入多尺度感受野大小:在计算自注意力矩阵之前,每个令牌以细粒度关注其最近的周围令牌,以粗粒度关注远处的令牌。这种机制能够有效地捕获多尺度上下文信息,同时降低计算复杂度。在 DiagSwin 注意力的基础上,我们提出了一种新的 Vision Transformer 变体,称为 DiagSWin Transformer,并在各种任务的广泛实验中证明了其优越性。具体来说,具有大尺寸的 DiagSwin Transformer 达到了 84.4%的 Top-1 准确率,并且在模型大小和计算成本减少 40%的情况下,优于 SOTA 的 CSWin Transformer 在 ImageNet 上的表现。作为骨干网络,DiagSWin Transformer 相对于当前的 SOTA 模块有显著的改进。此外,我们的 DiagSWin-Base 模型在 COCO 上的目标检测和分割中达到了 51.1box mAP 和 45.8mask mAP,在 ADE20K 上的语义分割中达到了 52.3mIoU。