Bao Liuxin, Zhou Xiaofei, Lu Xiankai, Sun Yaoqi, Yin Haibing, Hu Zhenghui, Zhang Jiyong, Yan Chenggang
IEEE Trans Image Process. 2024;33:3212-3226. doi: 10.1109/TIP.2024.3393365. Epub 2024 May 6.
Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, namely the visible-depth-thermal (VDT) SOD, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, in this paper, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture, which is equipped with the multi-scale fusion (MSF) module. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art methods with a large margin. Our code and results are available at https://github.com/Lx-Bao/QSFNet.
深度图像和热图像包含空间几何信息和表面温度信息,它们可以作为RGB模态的补充信息。然而,在一些具有挑战性的场景中,深度图像和热图像的质量往往不可靠,这将导致基于双模态的显著目标检测(SOD)性能下降。同时,一些研究人员关注三模态SOD任务,即可见光-深度-热(VDT)SOD,他们试图探索RGB图像、深度图像和热图像之间的互补性。然而,现有的三模态SOD方法无法感知深度图和热图像的质量,这导致在处理低质量深度和热图像的场景时性能下降。因此,在本文中,我们提出了一种质量感知选择性融合网络(QSF-Net)来进行VDT显著目标检测,它包含三个子网,即初始特征提取子网、质量感知区域选择子网和区域引导选择性融合子网。首先,除了提取特征外,初始特征提取子网可以通过收缩金字塔架构从每个模态生成一个初步预测图,该架构配备了多尺度融合(MSF)模块。然后,我们设计了弱监督质量感知区域选择子网来生成质量感知图。具体来说,我们首先利用初步预测找到高质量和低质量区域,这些区域进一步构成可用于训练该子网的伪标签。最后,区域引导选择性融合子网在质量感知图的引导下净化初始特征,然后融合三模态特征,并分别通过模态内和模态间注意力(IIA)模块和边缘细化(ER)模块细化预测图的边缘细节。在VDT-2048数据集上进行了大量实验,结果表明我们的显著性模型始终以较大优势优于13种最新方法。我们的代码和结果可在https://github.com/Lx-Bao/QSFNet上获取。