基于Transformer且带有动态注意力金字塔头的甚高分辨率遥感影像语义分割模型

Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery.

作者信息

Xu Yufen, Zhou Shangbo, Huang Yuhui

机构信息

College of Computer Science, Chongqing University, Chongqing 400044, China.

出版信息

Entropy (Basel). 2022 Nov 6;24(11):1619. doi: 10.3390/e24111619.

DOI:10.3390/e24111619

PMID:36359709

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9689728/

Abstract

Convolutional neural networks have long dominated semantic segmentation of very-high-resolution (VHR) remote sensing (RS) images. However, restricted by the fixed receptive field of convolution operation, convolution-based models cannot directly obtain contextual information. Meanwhile, Swin Transformer possesses great potential in modeling long-range dependencies. Nevertheless, Swin Transformer breaks images into patches that are single-dimension sequences without considering the position loss problem inside patches. Therefore, Inspired by Swin Transformer and Unet, we propose SUD-Net (Swin transformer-based Unet-like with Dynamic attention pyramid head Network), a new U-shaped architecture composed of Swin Transformer blocks and convolution layers simultaneously through a dual encoder and an upsampling decoder with a Dynamic Attention Pyramid Head (DAPH) attached to the backbone. First, we propose a dual encoder structure combining Swin Transformer blocks and reslayers in reverse order to complement global semantics with detailed representations. Second, aiming at the spatial loss problem inside each patch, we design a Multi-Path Fusion Model (MPFM) with specially devised Patch Attention (PA) to encode position information of patches and adaptively fuse features of different scales through attention mechanisms. Third, a Dynamic Attention Pyramid Head is constructed with deformable convolution to dynamically aggregate effective and important semantic information. SUD-Net achieves exceptional results on ISPRS Potsdam and Vaihingen datasets with 92.51%mF1, 86.4%mIoU, 92.98%OA, 89.49%mF1, 81.26%mIoU, and 90.95%OA, respectively.

摘要

卷积神经网络长期以来一直主导着超高分辨率（VHR）遥感（RS）图像的语义分割。然而，受卷积操作固定感受野的限制，基于卷积的模型无法直接获取上下文信息。同时，Swin Transformer在建模长距离依赖关系方面具有巨大潜力。然而，Swin Transformer将图像分割成单维序列的补丁，而没有考虑补丁内部的位置损失问题。因此，受Swin Transformer和Unet的启发，我们提出了SUD-Net（基于Swin Transformer的类Unet动态注意力金字塔头网络），这是一种新的U形架构，由Swin Transformer块和卷积层同时通过双编码器和上采样解码器组成，并在主干上附加了动态注意力金字塔头（DAPH）。首先，我们提出了一种双编码器结构，将Swin Transformer块和残差层以相反的顺序组合，以用详细表示补充全局语义。其次，针对每个补丁内部的空间损失问题，我们设计了一种多路径融合模型（MPFM），并特别设计了补丁注意力（PA），以编码补丁的位置信息，并通过注意力机制自适应地融合不同尺度的特征。第三，使用可变形卷积构建动态注意力金字塔头，以动态聚合有效和重要的语义信息。SUD-Net在ISPRS波茨坦和瓦辛根数据集上分别取得了优异的结果，mF1为92.51%，mIoU为86.4%，OA为92.98%，mF1为89.49%，mIoU为81.26%，OA为90.95%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于Transformer且带有动态注意力金字塔头的甚高分辨率遥感影像语义分割模型

Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

基于Transformer且带有动态注意力金字塔头的甚高分辨率遥感影像语义分割模型

Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献