Song Mengke, Song Wenfeng, Yang Guowei, Chen Chenglizhao
IEEE Trans Image Process. 2022;31:6124-6138. doi: 10.1109/TIP.2022.3205747. Epub 2022 Sep 22.
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-modal and cross-level saliency fusion, which has been proved to be efficient and effective. However, these methods still have a critical limitation, i.e., their fusion patterns - typically the combination of selective characteristics and its variations, are too highly dependent on the network's non-linear adaptability. In such methods, the balances between RGB and D (Depth) are formulated individually considering the intermediate feature slices, but the relation at the modality level may not be learned properly. The optimal RGB-D combinations differ depending on the RGB-D scenarios, and the exact complementary status is frequently determined by multiple modality-level factors, such as D quality, the complexity of the RGB scene, and degree of harmony between them. Therefore, given the existing approaches, it may be difficult for them to achieve further performance breakthroughs, as their methodologies belong to some methods that are somewhat less modality sensitive. To conquer this problem, this paper presents the Modality-aware Decoder (MaD). The critical technical innovations include a series of feature embedding, modality reasoning, and feature back-projecting and collecting strategies, all of which upgrade the widely-used multi-scale and multi-level decoding process to be modality-aware. Our MaD achieves competitive performance over other state-of-the-art (SOTA) models without using any fancy tricks in the decoder's design. Codes and results will be publicly available at https://github.com/MengkeSong/MaD.
大多数现有的RGB-D显著目标检测(SOD)方法主要聚焦于跨模态和跨层次的显著特征融合,这已被证明是高效且有效的。然而,这些方法仍存在一个关键局限性,即它们的融合模式——通常是选择性特征及其变体的组合,过于高度依赖网络的非线性适应性。在这类方法中,RGB和D(深度)之间的平衡是在考虑中间特征切片的基础上单独制定的,但模态层面的关系可能无法被正确学习。最优的RGB-D组合因RGB-D场景而异,确切的互补状态通常由多个模态层面的因素决定,如D的质量、RGB场景的复杂性以及它们之间的和谐程度。因此,鉴于现有方法,它们可能难以实现进一步的性能突破,因为它们的方法属于一些对模态不太敏感的方法。为解决这个问题,本文提出了模态感知解码器(MaD)。关键的技术创新包括一系列特征嵌入、模态推理以及特征反向投影和收集策略,所有这些都将广泛使用的多尺度和多层次解码过程升级为模态感知的。我们的MaD在解码器设计中不使用任何奇特技巧的情况下,相对于其他现有最先进(SOTA)模型实现了具有竞争力的性能。代码和结果将在https://github.com/MengkeSong/MaD上公开提供。