Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
Neural Netw. 2024 Feb;170:521-534. doi: 10.1016/j.neunet.2023.11.051. Epub 2023 Nov 24.
Image Salient Object Detection (SOD) is a fundamental research topic in the area of computer vision. Recently, the multimodal information in RGB, Depth (D), and Thermal (T) modalities has been proven to be beneficial to the SOD. However, existing methods are only designed for RGB-D or RGB-T SOD, which may limit the utilization in various modalities, or just finetuned on specific datasets, which may bring about extra computation overhead. These defects can hinder the practical deployment of SOD in real-world applications. In this paper, we propose an end-to-end Unified Triplet Decoder Network, dubbed UTDNet, for both RGB-T and RGB-D SOD tasks. The intractable challenges for the unified multimodal SOD are mainly two-fold, i.e., (1) accurately detecting and segmenting salient objects, and (2) preferably via a single network that fits both RGB-T and RGB-D SOD. First, to deal with the former challenge, we propose the multi-scale feature extraction unit to enrich the discriminative contextual information, and the efficient fusion module to explore cross-modality complementary information. Then, the multimodal features are fed to the triplet decoder, where the hierarchical deep supervision loss further enable the network to capture distinctive saliency cues. Second, as to the latter challenge, we propose a simple yet effective continual learning method to unify multimodal SOD. Concretely, we sequentially train multimodal SOD tasks by applying Elastic Weight Consolidation (EWC) regularization with the hierarchical loss function to avoid catastrophic forgetting without inducing more parameters. Critically, the triplet decoder separates task-specific and task-invariant information, making the network easily adaptable to multimodal SOD tasks. Extensive comparisons with 26 recently proposed RGB-T and RGB-D SOD methods demonstrate the superiority of the proposed UTDNet.
图像显著目标检测(SOD)是计算机视觉领域的一个基础研究课题。最近,RGB、Depth(D)和 Thermal(T)多模态信息已被证明对 SOD 有益。然而,现有的方法仅针对 RGB-D 或 RGB-T SOD 设计,这可能限制了在各种模态中的利用,或者只是在特定数据集上进行微调,这可能会带来额外的计算开销。这些缺陷可能会阻碍 SOD 在实际应用中的实际部署。在本文中,我们提出了一种端到端的统一三元解码器网络,称为 UTDNet,用于 RGB-T 和 RGB-D SOD 任务。统一多模态 SOD 的棘手挑战主要有两个,即(1)准确地检测和分割显著目标,(2)最好通过一个适合 RGB-T 和 RGB-D SOD 的单一网络。首先,为了解决前一个挑战,我们提出了多尺度特征提取单元来丰富判别上下文信息,以及高效融合模块来探索跨模态互补信息。然后,将多模态特征输入到三元解码器中,其中分层深度监督损失进一步使网络能够捕获独特的显著线索。其次,为了解决后一个挑战,我们提出了一种简单而有效的持续学习方法来统一多模态 SOD。具体来说,我们通过应用弹性权重整合(EWC)正则化和分层损失函数来顺序地训练多模态 SOD 任务,以避免灾难性遗忘而不会引入更多参数。至关重要的是,三元解码器分离了特定任务和不变任务的信息,使网络能够轻松适应多模态 SOD 任务。与 26 种最近提出的 RGB-T 和 RGB-D SOD 方法进行的广泛比较证明了所提出的 UTDNet 的优越性。