一种用于具有高视听对应性视频的多模态显著性模型。

A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence.

作者信息

Min Xiongkuo, Zhai Guangtao, Zhou Jiantao, Zhang Xiao-Ping, Yang Xiaokang, Guan Xinping

出版信息

IEEE Trans Image Process. 2020 Jan 17. doi: 10.1109/TIP.2020.2966082.

DOI:10.1109/TIP.2020.2966082

PMID:31976898

Abstract

Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.

摘要

当前大多数视觉注意力预测研究都忽略了音频信息。然而，声音可能会对视觉注意力产生影响，并且许多心理学研究已经对这种影响进行了广泛的调查和验证。在本文中，我们针对包含具有高视听对应性场景的视频提出了一种新颖的多模态显著性（MMS）模型。在这样的场景中，人类往往会被声源吸引，并且通过跨模态分析也有可能定位声源。具体来说，我们首先使用一种新颖的自由能原理从视觉模态中检测空间和时间显著性图。然后，我们提出通过使用跨模态核典型相关分析来定位移动发声物体，从而从音频和视觉模态中检测音频显著性图，这在文献中尚属首次。最后，我们提出了一种新的两阶段自适应视听显著性融合方法，将空间、时间和音频显著性图整合到我们的视听显著性图中。所提出的MMS模型捕捉到了音频的影响，这在最新的基于深度学习的显著性模型中并未被考虑。为了利用深度显著性建模和视听显著性建模的优势，我们建议通过后期融合将深度显著性模型和MMS模型相结合，并且我们发现平均性能提升了5%。在视听注意力数据库上的实验结果表明，引入的包含音频线索的模型比利用单一视觉模态的现有最先进的图像和视频显著性模型具有显著优势。