Xu Jiawei, Yue Shigang, Menchinelli Federica, Guo Kun
School of Computer Science, University of Lincoln , Lincoln , United Kingdom.
School of Psychology, University of Lincoln , Lincoln , United Kingdom.
PeerJ. 2017 Feb 1;5:e2946. doi: 10.7717/peerj.2946. eCollection 2017.
Recent research progress on the topic of human visual attention allocation in scene perception and its simulation is based mainly on studies with static images. However, natural vision requires us to extract visual information that constantly changes due to egocentric movements or dynamics of the world. It is unclear to what extent spatio-temporal regularity, an inherent regularity in dynamic vision, affects human gaze distribution and saliency computation in visual attention models. In this free-viewing eye-tracking study we manipulated the spatio-temporal regularity of traffic videos by presenting them in normal video sequence, reversed video sequence, normal frame sequence, and randomised frame sequence. The recorded human gaze allocation was then used as the 'ground truth' to examine the predictive ability of a number of state-of-the-art visual attention models. The analysis revealed high inter-observer agreement across individual human observers, but all the tested attention models performed significantly worse than humans. The inferior predictability of the models was evident from indistinguishable gaze prediction irrespective of stimuli presentation sequence, and weak central fixation bias. Our findings suggest that a realistic visual attention model for the processing of dynamic scenes should incorporate human visual sensitivity with spatio-temporal regularity and central fixation bias.
近期关于场景感知中人类视觉注意力分配及其模拟这一主题的研究进展主要基于对静态图像的研究。然而,自然视觉要求我们提取由于自我中心运动或世界动态变化而不断变化的视觉信息。尚不清楚动态视觉中固有的时空规律性在多大程度上影响视觉注意力模型中的人类注视分布和显著性计算。在这项自由观看眼动追踪研究中,我们通过以正常视频序列、倒放视频序列、正常帧序列和随机帧序列呈现交通视频来操纵时空规律性。然后,将记录的人类注视分配用作“真实情况”,以检验一些先进视觉注意力模型的预测能力。分析表明,不同个体人类观察者之间的观察者间一致性较高,但所有测试的注意力模型的表现均明显不如人类。模型预测能力较差从以下方面可见一斑:无论刺激呈现序列如何,注视预测都无法区分,且中央注视偏差较弱。我们的研究结果表明,用于处理动态场景的现实视觉注意力模型应将人类视觉敏感性与时空规律性和中央注视偏差相结合。