IEEE Trans Image Process. 2019 Jul;28(7):3502-3515. doi: 10.1109/TIP.2019.2897966. Epub 2019 Feb 7.
Visual attention is a dynamic process of scene exploration and information acquisition. However, existing research on attention modeling has concentrated on estimating static salient locations. In contrast, dynamic attributes presented by saccade have not been well explored in previous attention models. In this paper, we address the problem of saccadic scanpath prediction by introducing an iterative representation learning framework. Within the framework, saccade can be interpreted as an iterative process of predicting one fixation according to the current representation and updating the representation based on the gaze shift. In the predicting phase, we propose a Bayesian definition of saccade to combine the influence of perceptual residual and spatial location on the selection of fixations. In implementation, we compute the representation error of an autoencoder-based network to measure perceptual residuals of each area. Simultaneously, we integrate saccade amplitude and center-weighted mechanism to model the influence of spatial location. Based on estimating the influence of two parts, the final fixation is defined as the point with the largest posterior probability of gaze shift. In the updating phase, we update the representation pattern for the subsequent calculation by retraining the network with samples extracted around the current fixation. In the experiments, the proposed model can replicate the fundamental properties of psychophysics in visual search. In addition, it can achieve superior performance on several benchmark eye-tracking data sets.
视觉注意是一种动态的场景探索和信息获取过程。然而,现有的注意力建模研究主要集中在估计静态显著位置上。相比之下,先前的注意力模型中并没有很好地探索眼跳呈现的动态属性。在本文中,我们通过引入迭代表示学习框架来解决眼跳扫描路径预测问题。在该框架内,眼跳可以被解释为根据当前表示来预测一个注视点,并根据注视转移来更新表示的迭代过程。在预测阶段,我们提出了一种基于贝叶斯的眼跳定义,将感知残差和空间位置对注视点选择的影响结合起来。在实现方面,我们计算基于自动编码器的网络的表示误差,以衡量每个区域的感知残差。同时,我们整合眼跳幅度和中心加权机制来模拟空间位置的影响。基于估计这两部分的影响,最终的注视点被定义为注视转移后具有最大后验概率的点。在更新阶段,我们通过用当前注视点周围提取的样本重新训练网络,为后续计算更新表示模式。在实验中,所提出的模型可以复制视觉搜索中心理物理学的基本特性。此外,它在几个基准眼动数据集上也能取得优异的性能。