IEEE Trans Image Process. 2018 May;27(5):2368-2378. doi: 10.1109/TIP.2017.2787612. Epub 2017 Dec 27.
In this paper, we aim to predict human eye fixation with view-free scenes based on an end-to-end deep learning architecture. Although convolutional neural networks (CNNs) have made substantial improvement on human attention prediction, it is still needed to improve the CNN-based attention models by efficiently leveraging multi-scale features. Our visual attention network is proposed to capture hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Our model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields. Final saliency prediction is achieved via the cooperation of those global and local predictions. Our model is learned in a deep supervision manner, where supervision is directly fed into multi-level layers, instead of previous approaches of providing supervision only at the output layer and propagating this supervision back to earlier layers. Our model thus incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales. Extensive experimental analysis on various challenging benchmark data sets demonstrate our method yields the state-of-the-art performance with competitive inference time.
在本文中,我们旨在基于端到端深度学习架构,预测基于视图的场景中的人眼注视。尽管卷积神经网络(CNN)在人类注意力预测方面取得了重大进展,但仍需要通过有效地利用多尺度特征来改进基于 CNN 的注意力模型。我们的视觉注意网络旨在从具有全局显著性信息的深层、粗粒度层到具有局部显著性响应的浅层、细粒度层中捕获分层显著信息。我们的模型基于跳过层网络结构,从具有不同感受野的多个卷积层预测人类注意力。最终的显著度预测是通过全局和局部预测的合作来实现的。我们的模型以深度监督的方式进行学习,其中直接在多层级提供监督,而不是以前的方法只在输出层提供监督并将这种监督传播回早期层。我们的模型因此在单个网络中结合了多层次的显著度预测,这显著减少了学习具有不同输入尺度的多个网络流的先前方法的冗余性。在各种具有挑战性的基准数据集上进行的广泛实验分析表明,我们的方法在具有竞争力的推断时间下实现了最先进的性能。