Chang Junhao, Cen Yuefeng, Cen Gang
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China.
Sensors (Basel). 2024 Sep 25;24(19):6198. doi: 10.3390/s24196198.
The accurate extraction of buildings from remote sensing images is crucial in fields such as 3D urban planning, disaster detection, and military reconnaissance. In recent years, models based on Transformer have performed well in global information processing and contextual relationship modeling, but suffer from high computational costs and insufficient ability to capture local information. In contrast, convolutional neural networks (CNNs) are very effective in extracting local features, but have a limited ability to process global information. In this paper, an asymmetric network (CTANet), which combines the advantages of CNN and Transformer, is proposed to achieve efficient extraction of buildings. Specifically, CTANet employs ConvNeXt as an encoder to extract features and combines it with an efficient bilateral hybrid attention transformer (BHAFormer) which is designed as a decoder. The BHAFormer establishes global dependencies from both texture edge features and background information perspectives to extract buildings more accurately while maintaining a low computational cost. Additionally, the multiscale mixed attention mechanism module (MSM-AMM) is introduced to learn the multiscale semantic information and channel representations of the encoder features to reduce noise interference and compensate for the loss of information in the downsampling process. Experimental results show that the proposed model achieves the best F1-score (86.7%, 95.74%, and 90.52%) and IoU (76.52%, 91.84%, and 82.68%) compared to other state-of-the-art methods on the Massachusetts building dataset, the WHU building dataset, and the Inria aerial image labeling dataset.
从遥感图像中准确提取建筑物在三维城市规划、灾害检测和军事侦察等领域至关重要。近年来,基于Transformer的模型在全局信息处理和上下文关系建模方面表现出色,但存在计算成本高和捕捉局部信息能力不足的问题。相比之下,卷积神经网络(CNN)在提取局部特征方面非常有效,但处理全局信息的能力有限。本文提出了一种结合CNN和Transformer优点的非对称网络(CTANet),以实现建筑物的高效提取。具体而言,CTANet采用ConvNeXt作为编码器来提取特征,并将其与设计为解码器的高效双边混合注意力Transformer(BHAFormer)相结合。BHAFormer从纹理边缘特征和背景信息两个角度建立全局依赖关系,以更准确地提取建筑物,同时保持较低的计算成本。此外,引入了多尺度混合注意力机制模块(MSM-AMM)来学习编码器特征的多尺度语义信息和通道表示,以减少噪声干扰并补偿下采样过程中的信息损失。实验结果表明,与其他先进方法相比,该模型在马萨诸塞州建筑物数据集、武汉大学建筑物数据集和Inria航空图像标注数据集上取得了最佳的F1分数(分别为86.7%、95.74%和90.52%)和交并比(分别为76.52%、91.84%和82.68%)。