Li Xiang, Fu Chong, Wang Qun, Zhang Wenchao, Ye Chen, Chen Junxin, Sham Chiu-Wing
IEEE J Biomed Health Inform. 2025 Sep;29(9):6754-6766. doi: 10.1109/JBHI.2025.3555805.
Transformers have recently gained significant attention in medical image segmentation due to their ability to capture long-range dependencies. However, the presence of excessive background noise in large regions of medical images introduces distractions and increases the computational burden on the fine-grained self-attention (SA) mechanism, which is a key component of the transformer model. Meanwhile, preserving fine-grained details is essential for accurately segmenting complex, blurred medical images with diverse shapes and sizes. Thus, we propose a novel Multi-scale Dynamic Sparse Attention (MDSA) module, which flexibly reduces computational costs while maintaining multi-scale fine-grained interactions with content awareness. Specifically, multi-scale aggregation is first applied to the feature maps to enrich the diversity of interaction information. Then, for each query, irrelevant key-value pairs are filtered out at a coarse-grained level. Finally, fine-grained SA is performed on the remaining key-value pairs. In addition, we design an enhanced downsampling merging (EDM) module and an enhanced upsampling fusion (EUF) module for building pyramid architectures. Using MDSA to construct the basic blocks, combined with EDMs and EUFs, we develop a UNet-like model named MDSA-UNet. Since MDSA-UNet dynamically processes only a small subset of relevant fine-grained features, it achieves strong segmentation performance with high computational efficiency. Extensive experiments on four datasets spanning three different types demonstrate that our MDSA-UNet, without using pre-training, significantly outperforms other non-pretrained methods and even competes with pre-trained models, achieving Dice scores of 82.10% on DDTI, 80.20% on TN3K, 90.75% on ISIC2018, and 91.05% on ACDC. Meanwhile, our model maintains lower complexity, with only 6.65 M parameters and 4.54 G FLOPs at a resolution of 224 × 224, ensuring both effectiveness and efficiency. Code is available at URL.
由于能够捕捉长距离依赖关系,Transformer最近在医学图像分割中受到了广泛关注。然而,医学图像大片区域中存在的过多背景噪声会产生干扰,并增加细粒度自注意力(SA)机制的计算负担,而SA机制是Transformer模型的关键组成部分。同时,保留细粒度细节对于准确分割形状和大小各异的复杂模糊医学图像至关重要。因此,我们提出了一种新颖的多尺度动态稀疏注意力(MDSA)模块,该模块在保持与内容感知的多尺度细粒度交互的同时,灵活地降低了计算成本。具体来说,首先对特征图应用多尺度聚合,以丰富交互信息的多样性。然后,对于每个查询,在粗粒度级别过滤掉不相关的键值对。最后,对剩余的键值对执行细粒度SA。此外,我们设计了一个增强型下采样合并(EDM)模块和一个增强型上采样融合(EUF)模块来构建金字塔架构。使用MDSA构建基本块,并结合EDM和EUF,我们开发了一个名为MDSA-UNet的类似UNet的模型。由于MDSA-UNet仅动态处理一小部分相关的细粒度特征,因此它以高计算效率实现了强大的分割性能。在跨越三种不同类型的四个数据集上进行的广泛实验表明,我们的MDSA-UNet在不使用预训练的情况下,显著优于其他未预训练的方法,甚至可以与预训练模型竞争,在DDTI上的Dice分数为82.10%,在TN3K上为80.20%,在ISIC2018上为90.75%,在ACDC上为91.05%。同时,我们的模型保持较低的复杂度,在分辨率为224×224时仅有6.65M个参数和4.54G FLOP,确保了有效性和效率。代码可在URL获取。