Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands; Maastricht Brain Imaging Centre, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands.
Department of Data Science and Knowledge Engineering, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands.
Neural Netw. 2020 Sep;129:261-270. doi: 10.1016/j.neunet.2020.05.004. Epub 2020 May 8.
Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes. Our TensorFlow implementation is openly available at https://github.com/alexanderkroner/saliency.
预测自然图像中的显著区域需要检测场景中存在的物体。为了开发用于解决这一具有挑战性任务的稳健表示方法,必须提取多尺度的高层视觉特征,并结合上下文信息进行扩充。然而,现有的旨在解释人类注视图的模型并没有明确地包含这种机制。在这里,我们提出了一种基于在大规模图像分类任务上进行预训练的卷积神经网络的方法。该架构形成了一个编码器-解码器结构,并包含一个具有多个不同膨胀率的卷积层的模块,以并行捕获多尺度特征。此外,我们将得到的表示与全局场景信息相结合,以准确地预测视觉显著性。我们的模型在两个公共显著基准上的多个评估指标上都取得了有竞争力和一致的结果,并且我们在五个数据集和选定的示例上展示了所提出方法的有效性。与最先进的方法相比,该网络基于轻量级的图像分类骨干,因此是在计算资源有限的应用中(如虚拟)机器人系统估计人类在复杂自然场景中的注视点的合适选择。我们的 TensorFlow 实现可在 https://github.com/alexanderkroner/saliency 上公开获取。