Coutrot Antoine, Guyader Nathalie
Gipsa-lab, CNRS, & Grenoble-Alpes University, Grenoble, France.
J Vis. 2014 Jul 3;14(8):5. doi: 10.1167/14.8.5.
Conversation scenes are a typical example in which classical models of visual attention dramatically fail to predict eye positions. Indeed, these models rarely consider faces as particular gaze attractors and never take into account the important auditory information that always accompanies dynamic social scenes. We recorded the eye movements of participants viewing dynamic conversations taking place in various contexts. Conversations were seen either with their original soundtracks or with unrelated soundtracks (unrelated speech and abrupt or continuous natural sounds). First, we analyze how auditory conditions influence the eye movement parameters of participants. Then, we model the probability distribution of eye positions across each video frame with a statistical method (Expectation-Maximization), allowing the relative contribution of different visual features such as static low-level visual saliency (based on luminance contrast), dynamic low level visual saliency (based on motion amplitude), faces, and center bias to be quantified. Through experimental and modeling results, we show that regardless of the auditory condition, participants look more at faces, and especially at talking faces. Hearing the original soundtrack makes participants follow the speech turn-taking more closely. However, we do not find any difference between the different types of unrelated soundtracks. These eyetracking results are confirmed by our model that shows that faces, and particularly talking faces, are the features that best explain the gazes recorded, especially in the original soundtrack condition. Low-level saliency is not a relevant feature to explain eye positions made on social scenes, even dynamic ones. Finally, we propose groundwork for an audiovisual saliency model.
对话场景是视觉注意力经典模型在预测眼睛位置时严重失败的典型例子。事实上,这些模型很少将面部视为特殊的注视吸引物,并且从未考虑到始终伴随动态社交场景的重要听觉信息。我们记录了参与者观看在各种场景中进行的动态对话时的眼动情况。对话要么配有原始音轨,要么配有不相关的音轨(不相关的语音以及突然的或连续的自然声音)。首先,我们分析听觉条件如何影响参与者的眼动参数。然后,我们用一种统计方法(期望最大化)对每个视频帧上眼睛位置的概率分布进行建模,从而能够量化不同视觉特征的相对贡献,如静态低层次视觉显著性(基于亮度对比度)、动态低层次视觉显著性(基于运动幅度)、面部以及中心偏差。通过实验和建模结果,我们表明,无论听觉条件如何,参与者都会更多地注视面部,尤其是正在说话的面部。听到原始音轨会使参与者更紧密地跟随言语轮流。然而,我们没有发现不同类型的不相关音轨之间存在任何差异。我们的模型证实了这些眼动追踪结果,该模型表明面部,尤其是正在说话的面部,是最能解释所记录注视的特征,特别是在原始音轨条件下。低层次显著性并不是解释在社交场景(甚至是动态社交场景)中眼睛位置的相关特征。最后,我们为一个视听显著性模型奠定了基础。