IEEE Trans Pattern Anal Mach Intell. 2021 Jan;43(1):220-237. doi: 10.1109/TPAMI.2019.2924417. Epub 2020 Dec 4.
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
预测静态场景中的人眼注视点位置,也称为视觉显著性,最近受到了广泛关注。然而,人们在理解和建模动态场景中的视觉注意力方面的投入相对较少。这项工作为视频显著性研究做出了三个贡献。首先,我们引入了一个新的基准,称为 DHF1K(动态人眼注视 1K),用于预测动态场景自由观看时的注视点位置,这是该领域长期以来的需求。DHF1K 由 1K 个高质量的精心挑选的视频序列组成,这些序列由 17 个观察者使用眼动追踪设备进行注释。这些视频涵盖了广泛的场景、运动、物体类型和背景。其次,我们提出了一种新颖的视频显著性模型,称为 ACLNet(注意力 CNN-LSTM 网络),该模型在 CNN-LSTM 架构中增加了一个监督注意力机制,以实现快速端到端显著性学习。注意力机制显式地编码静态显著性信息,从而使 LSTM 能够专注于学习跨连续帧的更灵活的时间显著性表示。这种设计充分利用了现有的大规模静态注视数据集,避免了过拟合,并显著提高了训练效率和测试性能。第三,我们在三个数据集上对最先进的显著性模型进行了广泛的评估:DHF1K、好莱坞-2 和 UCF 运动。我们还对以前的显著性模型进行了基于属性的分析,并进行了跨数据集的泛化能力评估。在包含 40 万帧的 1200 多个测试视频上的实验结果表明,ACLNet 优于其他竞争者,并且具有较快的处理速度(使用单个 GPU 可达 40 fps)。我们的代码和所有结果都可以在 https://github.com/wenguanwang/DHF1K 上获得。