Schütt Heiko H, Rothkegel Lars O M, Trukenbrod Hans A, Engbert Ralf, Wichmann Felix A
Neural Information Processing Group, Universität Tübingen, Tübingen, Germany.
Experimental and Biological Psychology, University of Potsdam, Potsdam, Germany.
J Vis. 2019 Mar 1;19(3):1. doi: 10.1167/19.3.1.
Bottom-up and top-down as well as low-level and high-level factors influence where we fixate when viewing natural scenes. However, the importance of each of these factors and how they interact remains a matter of debate. Here, we disentangle these factors by analyzing their influence over time. For this purpose, we develop a saliency model that is based on the internal representation of a recent early spatial vision model to measure the low-level, bottom-up factor. To measure the influence of high-level, bottom-up features, we use a recent deep neural network-based saliency model. To account for top-down influences, we evaluate the models on two large data sets with different tasks: first, a memorization task and, second, a search task. Our results lend support to a separation of visual scene exploration into three phases: the first saccade, an initial guided exploration characterized by a gradual broadening of the fixation density, and a steady state that is reached after roughly 10 fixations. Saccade-target selection during the initial exploration and in the steady state is related to similar areas of interest, which are better predicted when including high-level features. In the search data set, fixation locations are determined predominantly by top-down processes. In contrast, the first fixation follows a different fixation density and contains a strong central fixation bias. Nonetheless, first fixations are guided strongly by image properties, and as early as 200 ms after image onset, fixations are better predicted by high-level information. We conclude that any low-level, bottom-up factors are mainly limited to the generation of the first saccade. All saccades are better explained when high-level features are considered, and later, this high-level, bottom-up control can be overruled by top-down influences.
自下而上和自上而下以及低级和高级因素会影响我们在观看自然场景时的注视位置。然而,这些因素各自的重要性以及它们如何相互作用仍是一个有争议的问题。在这里,我们通过分析它们随时间的影响来理清这些因素。为此,我们开发了一种显著性模型,该模型基于最近的早期空间视觉模型的内部表示来测量低级、自下而上的因素。为了测量高级、自上而下特征的影响,我们使用了一种基于深度神经网络的最新显著性模型。为了考虑自上而下的影响,我们在两个具有不同任务的大数据集上评估这些模型:第一,一个记忆任务;第二,一个搜索任务。我们的结果支持将视觉场景探索分为三个阶段:第一次扫视、以注视密度逐渐扩大为特征的初始引导探索,以及在大约10次注视后达到的稳定状态。初始探索和稳定状态期间的扫视目标选择与相似的感兴趣区域相关,当包含高级特征时,这些区域能得到更好的预测。在搜索数据集中,注视位置主要由自上而下的过程决定。相比之下,第一次注视遵循不同的注视密度,并且包含强烈的中央注视偏向。尽管如此,第一次注视受到图像属性的强烈引导,并且在图像呈现后最早200毫秒,注视就能通过高级信息得到更好的预测。我们得出结论,任何低级、自下而上的因素主要限于第一次扫视的产生。当考虑高级特征时,所有扫视都能得到更好的解释,并且在后期,这种高级、自上而下的控制可能会被自上而下的影响所推翻。