IEEE Trans Pattern Anal Mach Intell. 2017 Mar;39(3):576-588. doi: 10.1109/TPAMI.2016.2547384. Epub 2016 Mar 28.
Top-down visual saliency is an important module of visual attention. In this work, we propose a novel top-down saliency model that jointly learns a Conditional Random Field (CRF) and a visual dictionary. The proposed model incorporates a layered structure from top to bottom: CRF, sparse coding and image patches. With sparse coding as an intermediate layer, CRF is learned in a feature-adaptive manner; meanwhile with CRF as the output layer, the dictionary is learned under structured supervision. For efficient and effective joint learning, we develop a max-margin approach via a stochastic gradient descent algorithm. Experimental results on the Graz-02 and PASCAL VOC datasets show that our model performs favorably against state-of-the-art top-down saliency methods for target object localization. In addition, the dictionary update significantly improves the performance of our model. We demonstrate the merits of the proposed top-down saliency model by applying it to prioritizing object proposals for detection and predicting human fixations.
自上而下的视觉显著性是视觉注意的一个重要模块。在这项工作中,我们提出了一种新颖的自上而下的显着性模型,该模型联合学习条件随机场(CRF)和视觉词典。所提出的模型从顶部到底部采用分层结构:CRF、稀疏编码和图像补丁。通过稀疏编码作为中间层,以自适应特征的方式学习 CRF;同时,将 CRF 作为输出层,在结构监督下学习字典。为了实现高效和有效的联合学习,我们通过随机梯度下降算法开发了一种最大间隔方法。在 Graz-02 和 PASCAL VOC 数据集上的实验结果表明,我们的模型在目标对象定位方面优于最先进的自上而下的显着性方法。此外,字典更新显著提高了我们模型的性能。我们通过将其应用于检测对象提议的优先级排序和预测人类注视点,证明了所提出的自上而下的显着性模型的优点。