IEEE Trans Image Process. 2017 Sep;26(9):4446-4456. doi: 10.1109/TIP.2017.2710620.
Understanding and predicting the human visual attention mechanism is an active area of research in the fields of neuroscience and computer vision. In this paper, we propose DeepFix, a fully convolutional neural network, which models the bottom-up mechanism of visual attention via saliency prediction. Unlike classical works, which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts the saliency map in an end-to-end manner. DeepFix is designed to capture semantics at multiple scales while taking global context into account, by using network layers with very large receptive fields. Generally, fully convolutional nets are spatially invariant-this prevents them from modeling location-dependent patterns (e.g., centre-bias). Our network handles this by incorporating a novel location-biased convolutional layer. We evaluate our model on multiple challenging saliency data sets and show that it achieves the state-of-the-art results.
理解和预测人类视觉注意机制是神经科学和计算机视觉领域的一个活跃研究领域。在本文中,我们提出了 DeepFix,这是一个完全卷积神经网络,通过显着性预测来模拟视觉注意的自下而上机制。与使用各种手工制作的特征来描述显着性图的经典作品不同,我们的模型以分层的方式自动学习特征,并以端到端的方式预测显着性图。DeepFix 通过使用具有非常大感受野的网络层来捕获多个尺度的语义并考虑全局上下文,从而设计为捕获语义。通常,完全卷积网络是空间不变的 - 这阻止它们对位置相关的模式进行建模(例如,中心偏差)。我们的网络通过引入新颖的位置偏向卷积层来处理这个问题。我们在多个具有挑战性的显着性数据集上评估我们的模型,并表明它取得了最先进的结果。