Han De-Wei, Yin Xiao-Lei, Xu Jian, Li Kang, Li Jun-Jie, Wang Lu, Ma Zhao-Yuan
School of System Design and Intelligent Manufacturing, Southern University of Science and Technology, 1088 Xueyuan Boulevard, Nanshan District, Shenzhen, 518055, China.
The Future Laboratory, Tsinghua University, 160 Chengfu Road, Haidian District, 100084, Beijing, China.
Sci Rep. 2024 Jul 31;14(1):17719. doi: 10.1038/s41598-024-68587-1.
Swin Transformer is an important work among all the attempts to reduce the computational complexity of Transformers while maintaining its excellent performance in computer vision. Window-based patch self-attention can use the local connectivity of the image features, and the shifted window-based patch self-attention enables the communication of information between different patches in the entire image scope. Through in-depth research on the effects of different sizes of shifted windows on the patch information communication efficiency, this article proposes a Dual-Scale Transformer with double-sized shifted window attention method. The proposed method surpasses CNN-based methods such as U-Net, AttenU-Net, ResU-Net, CE-Net by a considerable margin (Approximately 3% 6% increase), and outperforms the Transformer based models single-scale Swin Transformer(SwinT)(Approximately 1% increase), on the datasets of the Kvasir-SEG, ISIC2017, MICCAI EndoVisSub-Instrument and CadVesSet. The experimental results verify that the proposed dual scale shifted window attention benefits the communication of patch information and can enhance the segmentation results to state of the art. We also implement an ablation study on the effect of the shifted window size on the information flow efficiency and verify that the dual-scale shifted window attention is the optimized network design. Our study highlights the significant impact of network structure design on visual performance, providing valuable insights for the design of networks based on Transformer architectures.
在所有旨在降低Transformer计算复杂度同时保持其在计算机视觉中卓越性能的尝试中,Swin Transformer是一项重要成果。基于窗口的补丁自注意力机制能够利用图像特征的局部连通性,而基于移位窗口的补丁自注意力机制则能在整个图像范围内实现不同补丁之间的信息交流。通过深入研究不同大小的移位窗口对补丁信息交流效率的影响,本文提出了一种具有双尺寸移位窗口注意力方法的双尺度Transformer。在Kvasir-SEG、ISIC2017、MICCAI EndoVisSub-Instrument和CadVesSet数据集上,该方法比基于卷积神经网络的方法(如U-Net、AttenU-Net、ResU-Net、CE-Net)有显著提升(大约提高3% - 6%),并且优于基于Transformer的单尺度Swin Transformer模型(大约提高1%)。实验结果验证了所提出的双尺度移位窗口注意力机制有利于补丁信息的交流,能够将分割结果提升至当前最优水平。我们还针对移位窗口大小对信息流效率的影响进行了消融研究,验证了双尺度移位窗口注意力机制是一种优化的网络设计。我们的研究突出了网络结构设计对视觉性能的重大影响,为基于Transformer架构的网络设计提供了有价值的见解。