Chen Hao, Deng Yongjian, Li Youfu, Hung Tzu-Yi, Lin Guosheng
IEEE Trans Image Process. 2020 Aug 12;PP. doi: 10.1109/TIP.2020.3014734.
Depth is beneficial for salient object detection (SOD) for its additional saliency cues. Existing RGBD SOD methods focus on tailoring complicated cross-modal fusion topologies, which although achieve encouraging performance, are with a high risk of over-fitting and ambiguous in studying cross-modal complementarity. Different from these conventional approaches combining cross-modal features entirely without differentiating, we concentrate our attention on decoupling the diverse cross-modal complements to simplify the fusion process and enhance the fusion sufficiency. We argue that if cross-modal heterogeneous representations can be disentangled explicitly, the cross-modal fusion process can hold less uncertainty, while enjoying better adaptability. To this end, we design a disentangled cross-modal fusion network to expose structural and content representations from both modalities by cross-modal reconstruction. For different scenes, the disentangled representations allow the fusion module to easily identify, and incorporate desired complements for informative multi-modal fusion. Extensive experiments show the effectiveness of our designs and a large outperformance over state-of-the-art methods.
深度因其额外的显著线索而有利于显著目标检测(SOD)。现有的RGBD SOD方法专注于定制复杂的跨模态融合拓扑结构,尽管这些方法取得了令人鼓舞的性能,但存在过度拟合的高风险,并且在研究跨模态互补性方面含糊不清。与这些完全不区分地组合跨模态特征的传统方法不同,我们将注意力集中在解耦多样的跨模态互补性上,以简化融合过程并增强融合充分性。我们认为,如果能够明确地解开跨模态异构表示,那么跨模态融合过程可以具有更少的不确定性,同时具有更好的适应性。为此,我们设计了一个解耦的跨模态融合网络,通过跨模态重建来揭示来自两种模态的结构和内容表示。对于不同的场景,解耦的表示允许融合模块轻松识别,并纳入所需的互补性以进行信息丰富的多模态融合。大量实验表明了我们设计的有效性,并且比现有最先进的方法有很大的优势。