Mademlis Ioannis, Tefas Anastasios, Nikolaidis Nikos, Pitas Ioannis
IEEE Trans Image Process. 2016 Dec;25(12):5828-5840. doi: 10.1109/TIP.2016.2615289. Epub 2016 Oct 5.
Video summarization is a timely and rapidly developing research field with broad commercial interest, due to the increasing availability of massive video data. Relevant algorithms face the challenge of needing to achieve a careful balance between summary compactness, enjoyability, and content coverage. The specific case of stereoscopic 3D theatrical films has become more important over the past years, but not received corresponding research attention. In this paper, a multi-stage, multimodal summarization process for such stereoscopic movies is proposed, that is able to extract a short, representative video skim conforming to narrative characteristics from a 3D film. At the initial stage, a novel, low-level video frame description method is introduced (frame moments descriptor) that compactly captures informative image statistics from luminance, color, optical flow, and stereoscopic disparity video data, both in a global and in a local scale. Thus, scene texture, illumination, motion, and geometry properties may succinctly be contained within a single frame feature descriptor, which can subsequently be employed as a building block in any key-frame extraction scheme, e.g., for intra-shot frame clustering. The computed key-frames are then used to construct a movie summary in the form of a video skim, which is post-processed in a manner that also considers the audio modality. The next stage of the proposed summarization pipeline essentially performs shot pruning, controlled by a user-provided shot retention parameter, that removes segments from the skim based on the narrative prominence of movie characters in both the visual and the audio modalities. This novel process (multimodal shot pruning) is algebraically modeled as a multimodal matrix column subset selection problem, which is solved using an evolutionary computing approach. Subsequently, disorienting editing effects induced by summarization are dealt with, through manipulation of the video skim. At the last step, the skim is suitably post-processed in order to reduce stereoscopic video defects that may cause visual fatigue.
由于海量视频数据的可用性不断提高,视频摘要成为一个具有广泛商业价值且发展迅速的研究领域。相关算法面临着在摘要紧凑性、观赏性和内容覆盖范围之间实现精细平衡的挑战。在过去几年中,立体3D电影的具体情况变得更加重要,但尚未得到相应的研究关注。本文提出了一种针对此类立体电影的多阶段、多模态摘要过程,该过程能够从3D电影中提取符合叙事特征的简短、具有代表性的视频梗概。在初始阶段,引入了一种新颖的低级视频帧描述方法(帧矩描述符),该方法能够在全局和局部尺度上紧凑地捕捉来自亮度、颜色、光流和立体视差视频数据的信息图像统计量。因此,场景纹理、光照、运动和几何属性可以简洁地包含在单个帧特征描述符中,随后可将其用作任何关键帧提取方案(例如,用于镜头内帧聚类)的构建块。然后,将计算出的关键帧用于构建视频梗概形式的电影摘要,并以同时考虑音频模态的方式对其进行后处理。所提出的摘要管道的下一阶段本质上执行镜头修剪,由用户提供的镜头保留参数控制,该参数基于电影角色在视觉和音频模态中的叙事突出性从梗概中删除片段。这个新颖的过程(多模态镜头修剪)被代数建模为一个多模态矩阵列子集选择问题,使用进化计算方法求解。随后,通过对视频梗概的处理来处理摘要引起的定向编辑效果。在最后一步,对梗概进行适当的后处理,以减少可能导致视觉疲劳的立体视频缺陷。