IEEE Trans Pattern Anal Mach Intell. 2015 Jul;37(7):1408-24. doi: 10.1109/TPAMI.2014.2366154.
Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in `saccade and fixate' regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of visual action and scene context recognition tasks. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 19 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as well as free-viewing. Second, we introduce novel dynamic consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results.
基于稀疏兴趣点算子的图像特征的词袋模型系统已成功应用于计算机视觉目标和动作识别任务。虽然基于稀疏兴趣点的识别方法与在“扫视和注视”模式下运行的生物系统的视觉处理不一致,但人类和计算机视觉领域的方法和重点仍然存在明显的差异。在这里,我们做出了三个贡献,旨在弥合这一差距。首先,我们用人类在视觉动作和场景上下文识别任务的生态约束下收集的眼动补充了现有的大规模动态计算机视觉标注数据集,如 Hollywood-2[1]和 UCF Sports[2]。据我们所知,这些是第一批为视频、视觉收集并公开提供的大规模人类眼动追踪数据集,它们具有(a)大规模和计算机视觉相关性,(b)动态视频刺激,(c)任务控制以及自由观看的特点。其次,我们引入了新的动态一致性和对齐度量标准,这些标准突出了受试者之间视觉搜索模式的显著稳定性。第三,我们利用大量收集的数据来进行研究,并构建基于人类眼动的自动、端到端可训练的计算机视觉系统。我们的研究不仅揭示了计算机视觉时空兴趣点图像采样策略与人类注视之间的差异,以及它们对视觉识别性能的影响,而且还表明,人类注视可以被准确预测,并且当用于端到端自动系统时,利用一些先进的计算机视觉实践,可以达到最先进的结果。