Xu Guoan, Jia Wenjing, Wu Tao, Chen Ligeng, Gao Guangwei
IEEE Trans Image Process. 2024;33:4202-4214. doi: 10.1109/TIP.2024.3425048. Epub 2024 Jul 22.
Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at https://github.com/XU-GITHUB-curry/HAFormer.
卷积神经网络(CNNs)和Transformer在语义分割任务中都取得了巨大成功。人们已努力将CNNs与Transformer模型集成,以捕捉局部和全局上下文交互。然而,仍有改进空间,特别是考虑到计算资源的限制时。在本文中,我们介绍了HAFormer,一种将CNNs的分层特征提取能力与Transformer的全局依赖性建模能力相结合的模型,以应对轻量级语义分割挑战。具体而言,我们设计了一个层次感知像素激励(HAPE)模块用于自适应多尺度局部特征提取。在全局感知建模期间,我们设计了一个高效Transformer(ET)模块,简化了与传统Transformer相关的二次计算。此外,一个相关加权融合(cwF)模块选择性地合并不同的特征表示,显著提高预测准确性。HAFormer以最小的计算开销和紧凑的模型大小实现了高性能,在Cityscapes测试数据集上达到了74.2%的平均交并比(mIoU),在CamVid测试数据集上达到了71.1%的mIoU,在单个2080Ti GPU上的帧率分别为105FPS和118FPS。源代码可在https://github.com/XU-GITHUB-curry/HAFormer获取。