IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1382-1396. doi: 10.1109/TPAMI.2017.2713785. Epub 2017 Jun 8.
Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract accurate masks from networks pre-trained for the task of object recognition, thus forgoing external objectness modules. We first show how foreground/background masks can be obtained from the activations of higher-level convolutional layers of a network. We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network. Our experiments evidence that exploiting these masks in conjunction with a weakly-supervised training loss yields state-of-the-art tag-based weakly-supervised semantic segmentation results.
像素级注释的获取既昂贵又耗时。因此,仅使用图像标签进行弱监督可能会对语义分割产生重大影响。最近,基于 CNN 的方法已经提出使用图像标签来微调预训练的网络。没有额外的信息,这会导致定位精度较差。然而,通过利用目标先验来生成前景/背景掩模,这个问题得到了缓解。不幸的是,这些先验要么需要像素级注释/边界框,要么仍然产生不准确的目标边界。在这里,我们提出了一种从专门用于对象识别任务的网络中提取准确掩模的新方法,从而避免了外部对象模块。我们首先展示如何从网络的高级卷积层的激活中获得前景/背景掩模。然后,我们展示如何通过融合前景/背景掩模和从弱监督定位网络中提取的信息来获得多类掩模。我们的实验证明,在弱监督训练损失中利用这些掩模可以获得基于标签的最新弱监督语义分割结果。