Chen Hao, Shen Feihong, Ding Ding, Deng Yongjian, Li Chao
IEEE Trans Image Process. 2024;33:1699-1709. doi: 10.1109/TIP.2024.3364022. Epub 2024 Mar 5.
Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.
以往用于RGB-D显著目标检测(SOD)的多模态变换器通常直接连接来自两种模态的所有补丁,以对跨模态相关性进行建模,并进行无差别的多模态组合,这可能导致混淆且低效的融合。相反,我们从两个视角解开跨模态互补性,以减少跨模态融合的模糊性:1)上下文解缠。我们认为,由于严重的模态差距,像以前那样跨模态建模长程依赖是无信息的。不同的是,我们建议将跨模态互补上下文分别解缠为模态内自注意力,以探索全局互补理解,以及空间对齐的跨模态注意力,以捕捉局部跨模态相关性。2)表征解缠。与以往对跨模态表征进行无差别的组合不同,我们发现跨模态线索通过增强共同的判别区域和相互补充模态特定的突出部分来相互补充。在此基础上,我们在通道维度上将令牌分为一致的和私有的,以解开多模态集成路径,并明确促进两种互补方式。通过在各层逐步传播这一策略,所提出的解缠特征金字塔模块(DFP)实现了信息丰富的跨模态跨层次集成和更好的融合适应性。在各种公共数据集上进行的综合实验验证了我们上下文和表征解缠的有效性,以及相对于现有模型的持续改进。此外,我们的跨模态注意力层次结构可以针对不同的骨干架构(变压器和卷积神经网络)和下游任务即插即用,在基于卷积神经网络的模型和RGB-D语义分割上的实验验证了这种泛化能力。