Li Xiangyang, Li Yafeng, Fan Pan, Zhang Xueya
School of Computer, Baoji University of Arts and Science, Baoji 721016, China.
Sensors (Basel). 2025 Aug 14;25(16):5046. doi: 10.3390/s25165046.
The critical component of the vision transformer (ViT) architecture is multi-head self-attention (MSA), which enables the encoding of long-range dependencies and heterogeneous interactions. However, MSA has two significant limitations: its limited ability to capture local features and its high computational costs. To address these challenges, this paper proposes an integrated multi-head self-attention approach with a bottleneck enhancement structure, named WMSA-WBS, which mitigates the aforementioned shortcomings of conventional MSA. Different from existing wavelet-enhanced ViT variants that mainly focus on the isolated wavelet decomposition in the attention layer, WMSA-WBS introduces a co-design of wavelet-based frequency processing and bottleneck optimization, achieving more efficient and comprehensive feature learning. Within WMSA-WBS, the proposed wavelet multi-head self-attention (WMSA) approach is combined with a novel wavelet bottleneck structure to capture both global and local information across the spatial, frequency, and channel domains. Specifically, this module achieves these capabilities while maintaining low computational complexity and memory consumption. Extensive experiments demonstrate that ViT models equipped with WMSA-WBS achieve superior trade-offs between accuracy and model complexity across various vision tasks, including image classification, object detection, and semantic segmentation.
视觉Transformer(ViT)架构的关键组件是多头自注意力(MSA),它能够对长距离依赖关系和异构交互进行编码。然而,MSA有两个显著的局限性:其捕捉局部特征的能力有限以及计算成本高昂。为应对这些挑战,本文提出了一种具有瓶颈增强结构的集成多头自注意力方法,名为WMSA-WBS,它减轻了传统MSA的上述缺点。与现有的主要关注注意力层中孤立小波分解的小波增强ViT变体不同,WMSA-WBS引入了基于小波的频率处理和瓶颈优化的协同设计,实现了更高效、更全面的特征学习。在WMSA-WBS中,所提出的小波多头自注意力(WMSA)方法与一种新颖的小波瓶颈结构相结合,以在空间、频率和通道域中捕捉全局和局部信息。具体而言,该模块在保持低计算复杂度和内存消耗的同时实现了这些能力。大量实验表明,配备WMSA-WBS的ViT模型在包括图像分类、目标检测和语义分割在内的各种视觉任务中,在准确性和模型复杂度之间实现了更好的权衡。