Salehin Md Musfequs, Paul Manoranjan
J Opt Soc Am A Opt Image Sci Vis. 2017 May 1;34(5):814-826. doi: 10.1364/JOSAA.34.000814.
Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.
监控摄像机每天都会捕捉大量的连续视频流。要分析或调查任何重大事件,如果手动从海量视频数据中识别这些事件,都是一项费力且枯燥的工作。现有方法有时会忽略具有重要视觉内容的关键帧,并且/或者选择一些没有活动或活动较少的不重要帧。为了解决这个问题,本文提出了一种视频摘要技术,该技术结合了前景对象、运动信息和视觉显著性这三个多模态人类视觉敏感特征。在视频流中,前景对象是视频中最重要的部分之一,因为它们包含更详细的信息,并且在重要事件中起主要作用。此外,运动是视频的另一种刺激因素,能显著吸引人类的视觉注意力。为了获取运动信息,在空间域和频率域都进行了计算。空间运动信息可以准确地选择对象运动;然而,它对光照变化很敏感。另一方面,频率运动信息对光照变化具有鲁棒性,尽管它很容易受到噪声的影响。因此,采用了空间域和频率域的运动信息。此外,视觉注意力线索是一种敏感特征,用于测量用户的吸引标签指示以确定关键帧。由于这些特征单独表现不佳,因此将它们组合起来以获得更好的结果。为此,提出了一种自适应线性加权融合方案来组合这些特征,以便对视频帧进行排序以进行摘要。实验结果表明,所提出的方法优于现有方法。