基于解缠跨模态融合的RGB-D显著目标检测

RGBD Salient Object Detection via Disentangled Cross-modal Fusion.

作者信息

Chen Hao, Deng Yongjian, Li Youfu, Hung Tzu-Yi, Lin Guosheng

出版信息

IEEE Trans Image Process. 2020 Aug 12;PP. doi: 10.1109/TIP.2020.3014734.

DOI:10.1109/TIP.2020.3014734

Abstract

Depth is beneficial for salient object detection (SOD) for its additional saliency cues. Existing RGBD SOD methods focus on tailoring complicated cross-modal fusion topologies, which although achieve encouraging performance, are with a high risk of over-fitting and ambiguous in studying cross-modal complementarity. Different from these conventional approaches combining cross-modal features entirely without differentiating, we concentrate our attention on decoupling the diverse cross-modal complements to simplify the fusion process and enhance the fusion sufficiency. We argue that if cross-modal heterogeneous representations can be disentangled explicitly, the cross-modal fusion process can hold less uncertainty, while enjoying better adaptability. To this end, we design a disentangled cross-modal fusion network to expose structural and content representations from both modalities by cross-modal reconstruction. For different scenes, the disentangled representations allow the fusion module to easily identify, and incorporate desired complements for informative multi-modal fusion. Extensive experiments show the effectiveness of our designs and a large outperformance over state-of-the-art methods.

摘要

深度因其额外的显著线索而有利于显著目标检测（SOD）。现有的RGBD SOD方法专注于定制复杂的跨模态融合拓扑结构，尽管这些方法取得了令人鼓舞的性能，但存在过度拟合的高风险，并且在研究跨模态互补性方面含糊不清。与这些完全不区分地组合跨模态特征的传统方法不同，我们将注意力集中在解耦多样的跨模态互补性上，以简化融合过程并增强融合充分性。我们认为，如果能够明确地解开跨模态异构表示，那么跨模态融合过程可以具有更少的不确定性，同时具有更好的适应性。为此，我们设计了一个解耦的跨模态融合网络，通过跨模态重建来揭示来自两种模态的结构和内容表示。对于不同的场景，解耦的表示允许融合模块轻松识别，并纳入所需的互补性以进行信息丰富的多模态融合。大量实验表明了我们设计的有效性，并且比现有最先进的方法有很大的优势。