CFANet：用于室内RGB-D语义分割的跨模态融合注意力网络

CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation.

作者信息

Wu Long-Fei, Wei Dan, Xu Chang-An

机构信息

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China.

College of Materials and Energy, South China Agricultural University, Guangzhou 510642, China.

出版信息

J Imaging. 2025 May 27;11(6):177. doi: 10.3390/jimaging11060177.

DOI:10.3390/jimaging11060177

PMID:40558775

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12194209/

Abstract

Indoor image semantic segmentation technology is applied to fields such as smart homes and indoor security. The challenges faced by semantic segmentation techniques using RGB images and depth maps as data sources include the semantic gap between RGB images and depth maps and the loss of detailed information. To address these issues, a multi-head self-attention mechanism is adopted to adaptively align features of the two modalities and perform feature fusion in both spatial and channel dimensions. Appropriate feature extraction methods are designed according to the different characteristics of RGB images and depth maps. For RGB images, asymmetric convolution is introduced to capture features in the horizontal and vertical directions, enhance short-range information dependence, mitigate the gridding effect of dilated convolution, and introduce criss-cross attention to obtain contextual information from global dependency relationships. On the depth map, a strategy of extracting significant unimodal features from the channel and spatial dimensions is used. A lightweight skip connection module is designed to fuse low-level and high-level features. In addition, since the first layer contains the richest detailed information and the last layer contains rich semantic information, a feature refinement head is designed to fuse the two. The method achieves an mIoU of 53.86% and 51.85% on the NYUDv2 and SUN-RGBD datasets, which is superior to mainstream methods.

摘要

室内图像语义分割技术应用于智能家居和室内安全等领域。以RGB图像和深度图作为数据源的语义分割技术面临的挑战包括RGB图像和深度图之间的语义鸿沟以及详细信息的丢失。为了解决这些问题，采用多头自注意力机制来自适应地对齐两种模态的特征，并在空间和通道维度上进行特征融合。根据RGB图像和深度图的不同特性设计了合适的特征提取方法。对于RGB图像，引入非对称卷积来捕捉水平和垂直方向的特征，增强短距离信息依赖性，减轻空洞卷积的网格效应，并引入十字交叉注意力以从全局依赖关系中获取上下文信息。在深度图上，采用从通道和空间维度提取显著单峰特征的策略。设计了一个轻量级跳跃连接模块来融合低级和高级特征。此外，由于第一层包含最丰富的详细信息，最后一层包含丰富的语义信息，设计了一个特征细化头来融合两者。该方法在NYUDv2和SUN-RGBD数据集上分别达到了53.86%和51.85%的平均交并比，优于主流方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

CFANet：用于室内RGB-D语义分割的跨模态融合注意力网络

CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

CFANet：用于室内RGB-D语义分割的跨模态融合注意力网络

CFANet: The Cross-Modal Fusion Attention Network for Indoor RGB-D Semantic Segmentation.

作者信息

机构信息

出版信息

相似文献

本文引用的文献