Wang Yan, Cao Li, Deng He
School of Electrical and Electronic Engineering, Wuhan Polytechnic University, Wuhan 430023, China.
School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430081, China.
Sensors (Basel). 2024 Nov 13;24(22):7266. doi: 10.3390/s24227266.
Semantic segmentation of remote sensing images is a fundamental task in computer vision, holding substantial relevance in applications such as land cover surveys, environmental protection, and urban building planning. In recent years, multi-modal fusion-based models have garnered considerable attention, exhibiting superior segmentation performance when compared with traditional single-modal techniques. Nonetheless, the majority of these multi-modal models, which rely on Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for feature fusion, face limitations in terms of remote modeling capabilities or computational complexity. This paper presents a novel Mamba-based multi-modal fusion network called MFMamba for semantic segmentation of remote sensing images. Specifically, the network employs a dual-branch encoding structure, consisting of a CNN-based main encoder for extracting local features from high-resolution remote sensing images (HRRSIs) and of a Mamba-based auxiliary encoder for capturing global features on its corresponding digital surface model (DSM). To capitalize on the distinct attributes of the multi-modal remote sensing data from both branches, a feature fusion block (FFB) is designed to synergistically enhance and integrate the features extracted from the dual-branch structure at each stage. Extensive experiments on the Vaihingen and the Potsdam datasets have verified the effectiveness and superiority of MFMamba in semantic segmentation of remote sensing images. Compared with state-of-the-art methods, MFMamba achieves higher overall accuracy (OA) and a higher mean F1 score (mF1) and mean intersection over union (mIoU), while maintaining low computational complexity.
遥感图像的语义分割是计算机视觉中的一项基础任务,在土地覆盖调查、环境保护和城市建筑规划等应用中具有重要意义。近年来,基于多模态融合的模型受到了广泛关注,与传统的单模态技术相比,展现出了卓越的分割性能。然而,这些大多依赖卷积神经网络(CNN)或视觉Transformer(ViT)进行特征融合的多模态模型,在远程建模能力或计算复杂度方面存在局限性。本文提出了一种名为MFMamba的基于曼巴的新型多模态融合网络,用于遥感图像的语义分割。具体而言,该网络采用了双分支编码结构,由一个基于CNN的主编码器和一个基于曼巴的辅助编码器组成,前者用于从高分辨率遥感图像(HRRSI)中提取局部特征,后者用于在其相应的数字表面模型(DSM)上捕捉全局特征。为了利用来自两个分支的多模态遥感数据的不同属性,设计了一个特征融合块(FFB),以在每个阶段协同增强和整合从双分支结构中提取的特征。在Vaihingen和Potsdam数据集上进行的大量实验验证了MFMamba在遥感图像语义分割中的有效性和优越性。与现有方法相比,MFMamba在保持低计算复杂度的同时,实现了更高的总体准确率(OA)、更高的平均F1分数(mF1)和平均交并比(mIoU)。