Xu Mai, Liu Yufan, Hu Haoji, He Feng
IEEE Trans Image Process. 2018 Sep;27(9):4529-4544. doi: 10.1109/TIP.2018.2837106. Epub 2018 May 16.
The past decade has witnessed the use of highlevel features in saliency prediction for both videos and images. Unfortunately, the existing saliency prediction methods only handle high-level static features, such as face. In fact, high-level dynamic features (also called actions), such as speaking or head turning, are also extremely attractive to visual attention in videos. Thus, in this paper, we propose a data-driven method for learning to predict the saliency of multiple-face videos, by leveraging both static and dynamic features at high-level. Specifically, we introduce an eye-tracking database, collecting the fixations of 39 subjects viewing 65 multiple-face videos. Through analysis on our database, we find a set of high-level features that cause a face to receive extensive visual attention. These high-level features include the static features of face size, center-bias and head pose, as well as the dynamic features of speaking and head turning. Then, we present the techniques for extracting these high-level features. Afterwards, a novel model, namely multiple hidden Markov model (M-HMM), is developed in our method to enable the transition of saliency among faces. In our MHMM, the saliency transition takes into account both the state of saliency at previous frames and the observed high-level features at the current frame. The experimental results show that the proposed method is superior to other state-of-the-art methods in predicting visual attention on multiple-face videos. Finally, we shed light on a promising implementation of our saliency prediction method in locating the region-of-interest (ROI), for video conference compression with high efficiency video coding (HEVC).
在过去十年中,视频和图像的显著性预测中都出现了对高级特征的运用。遗憾的是,现有的显著性预测方法仅处理诸如面部等高级静态特征。实际上,诸如说话或转头等高级动态特征(也称为动作)在视频中对视觉注意力也极具吸引力。因此,在本文中,我们提出一种数据驱动的方法,通过利用高级别的静态和动态特征来学习预测多面部视频的显著性。具体而言,我们引入了一个眼动追踪数据库,收集了39名受试者观看65个多面部视频时的注视点。通过对我们数据库的分析,我们发现了一组能使面部获得广泛视觉关注的高级特征。这些高级特征包括面部大小、中心偏差和头部姿势的静态特征,以及说话和转头的动态特征。然后,我们介绍了提取这些高级特征的技术。之后,在我们的方法中开发了一种新颖的模型,即多隐马尔可夫模型(M-HMM),以实现面部之间显著性的转移。在我们的M-HMM中,显著性转移既考虑了前几帧的显著性状态,也考虑了当前帧中观察到的高级特征。实验结果表明,所提出的方法在预测多面部视频的视觉注意力方面优于其他现有最先进的方法。最后,我们阐明了我们的显著性预测方法在定位感兴趣区域(ROI)方面的一个有前景的实现方式,用于采用高效视频编码(HEVC)的视频会议压缩。