IEEE Trans Image Process. 2021;30:4238-4252. doi: 10.1109/TIP.2021.3068649. Epub 2021 Apr 12.
Human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) studies conduct their saliency predictions via a multitask methodology in which pixelwise saliency regression and segmentation-like saliency refinement are conducted simultaneously. However, this multitask methodology has one critical limitation: the semantic information embedded in feature backbones might be degenerated during the training process. Our visual attention is determined mainly by semantic information, which is evidenced by our tendency to pay more attention to semantically salient regions even if these regions are not the most perceptually salient at first glance. This fact clearly contradicts the widely used multitask methodology mentioned above. To address this issue, this paper divides the SOD problem into two sequential steps. First, we devise a lightweight, weakly supervised deep network to coarsely locate the semantically salient regions. Next, as a postprocessing refinement, we selectively fuse multiple off-the-shelf deep models on the semantically salient regions identified by the previous step to formulate a pixelwise saliency map. Compared with the state-of-the-art (SOTA) models that focus on learning the pixelwise saliency in single images using only perceptual clues, our method aims at investigating the object-level semantic ranks between multiple images, of which the methodology is more consistent with the human attention mechanism. Our method is simple yet effective, and it is the first attempt to consider salient object detection as mainly an object-level semantic reranking problem.
人类注意力是我们的视觉系统和大脑之间的一种互动活动,同时利用低水平视觉刺激和高水平语义信息。以前的图像显著目标检测 (SOD) 研究通过一种多任务方法进行其显著性预测,其中像素级显著性回归和类似分割的显著性细化同时进行。然而,这种多任务方法有一个关键的局限性:特征骨干中嵌入的语义信息在训练过程中可能会退化。我们的视觉注意力主要由语义信息决定,这一点可以从我们倾向于关注语义上显著的区域得到证明,即使这些区域在第一眼看起来不是最明显的。这一事实显然与上面提到的广泛使用的多任务方法相矛盾。为了解决这个问题,本文将 SOD 问题分为两个连续的步骤。首先,我们设计了一个轻量级的、弱监督的深度网络来粗略地定位语义上显著的区域。接下来,作为后处理细化,我们选择性地融合多个现成的深度模型在前面步骤中识别的语义显著区域,以形成像素级显著性图。与专注于仅使用感知线索在单个图像中学习像素级显著性的最新模型 (SOTA) 相比,我们的方法旨在研究多个图像之间的对象级语义排名,其方法更符合人类注意力机制。我们的方法简单而有效,这是首次尝试将显著目标检测主要视为对象级语义重新排序问题。