Lyu Pengfei, Yu Xiaosheng, Chi Jianning, Wu Hao, Wu Chengdong, Rajapakse Jagath C
IEEE Trans Image Process. 2025;34:2796-2810. doi: 10.1109/TIP.2025.3564821. Epub 2025 May 12.
Exploring complementary information between RGB and thermal/depth modalities is crucial for bi-modal salient object detection (BSOD). However, the distinct characteristics of different modalities often lead to large differences in information distributions. Existing models, which rely on convolutional operations or plug-and-play attention mechanisms, struggle to address this issue. To overcome this challenge, we rethink the relationship between information complementarity and long-range relevance, and propose a uniform broad-view Twins Transformer Network (TwinsTNet) for accurate BSOD. Specifically, to efficiently fuse bi-modal information, we first design the Cross-Modal Federated Attention (CMFA), which mines complementary cues across modalities through element-wise global dependency. Second, to ensure accurate modality fusion, we propose the Semantic Consistency Attention Loss, which supervises the co-attention feature in CMFA using the ground-truth-generated attention map. Additionally, existing BSOD models lack the exploration of inter-layer interactions, for which we propose the Cross-Scale Retracing Attention (CSRA), which retrieves query-relevant information from stacked features of all previous layers, enabling flexible cross-layer interactions. The cooperation between CMFA and CSRA mitigates inductive bias in both modality and layer dimensions, enhancing TwinsTNet's representational capability. Extensive experiments demonstrate that TwinsTNet outperforms twenty-two existing state-of-the-art models on ten BSOD benchmark datasets. The code is available at: https://github.com/JoshuaLPF/TwinsTNet.
探索RGB与热成像/深度模态之间的互补信息对于双模态显著目标检测(BSOD)至关重要。然而,不同模态的独特特征往往导致信息分布存在很大差异。现有的依赖卷积操作或即插即用注意力机制的模型难以解决这一问题。为了克服这一挑战,我们重新思考信息互补性与长距离相关性之间的关系,并提出一种统一的广视角孪生变压器网络(TwinsTNet)用于精确的双模态显著目标检测。具体而言,为了有效融合双模态信息,我们首先设计了跨模态联合注意力(CMFA),它通过逐元素全局依赖性挖掘跨模态的互补线索。其次,为了确保精确的模态融合,我们提出了语义一致性注意力损失,它使用由真实标签生成的注意力图来监督CMFA中的协同注意力特征。此外,现有的双模态显著目标检测模型缺乏对层间交互的探索,为此我们提出了跨尺度回溯注意力(CSRA),它从所有先前层的堆叠特征中检索与查询相关的信息,实现灵活的跨层交互。CMFA和CSRA之间的协作减轻了模态和层维度上的归纳偏差,增强了TwinsTNet的表征能力。大量实验表明,TwinsTNet在十个双模态显著目标检测基准数据集上优于二十二个现有的先进模型。代码可在以下网址获取:https://github.com/JoshuaLPF/TwinsTNet。