Song Xiaofei, Chen Mingju, Rao Jie, Luo Yangming, Lin Zhihao, Zhang Xingyue, Li Senyuan, Hu Xiao
School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644005, China.
Intelligent Perception and Control Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644005, China.
Sensors (Basel). 2025 Jul 27;25(15):4660. doi: 10.3390/s25154660.
To improve semantic segmentation performance for complex urban remote sensing images with multi-scale object distribution, class similarity, and small object omission, this paper proposes MFPI-Net, an encoder-decoder-based semantic segmentation network. It includes four core modules: a Swin Transformer backbone encoder, a diverse dilation rates attention shuffle decoder (DDRASD), a multi-scale convolutional feature enhancement module (MCFEM), and a cross-path residual fusion module (CPRFM). The Swin Transformer efficiently extracts multi-level global semantic features through its hierarchical structure and window attention mechanism. The DDRASD's diverse dilation rates attention (DDRA) block combines convolutions with diverse dilation rates and channel-coordinate attention to enhance multi-scale contextual awareness, while Shuffle Block improves resolution via pixel rearrangement and avoids checkerboard artifacts. The MCFEM enhances local feature modeling through parallel multi-kernel convolutions, forming a complementary relationship with the Swin Transformer's global perception capability. The CPRFM employs multi-branch convolutions and a residual multiplication-addition fusion mechanism to enhance interactions among multi-source features, thereby improving the recognition of small objects and similar categories. Experiments on the ISPRS Vaihingen and Potsdam datasets show that MFPI-Net outperforms mainstream methods, achieving 82.57% and 88.49% mIoU, validating its superior segmentation performance in urban remote sensing.
为了提高复杂城市遥感图像的语义分割性能,以应对多尺度目标分布、类别相似性和小目标遗漏等问题,本文提出了MFPI-Net,一种基于编码器-解码器的语义分割网络。它包括四个核心模块:一个Swin Transformer主干编码器、一个多尺度扩张率注意力洗牌解码器(DDRASD)、一个多尺度卷积特征增强模块(MCFEM)和一个跨路径残差融合模块(CPRFM)。Swin Transformer通过其分层结构和窗口注意力机制有效地提取多级全局语义特征。DDRASD的多尺度扩张率注意力(DDRA)块将具有不同扩张率的卷积与通道坐标注意力相结合,以增强多尺度上下文感知,而洗牌块通过像素重排提高分辨率并避免棋盘效应。MCFEM通过并行多内核卷积增强局部特征建模,与Swin Transformer的全局感知能力形成互补关系。CPRFM采用多分支卷积和残差乘加融合机制来增强多源特征之间的交互,从而提高对小目标和相似类别的识别能力。在ISPRS Vaihingen和波茨坦数据集上的实验表明,MFPI-Net优于主流方法,分别达到了82.57%和88.49%的平均交并比,验证了其在城市遥感中的卓越分割性能。