Xu Dong, Chang Shih-Fu
School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Blk N4, Singapore.
IEEE Trans Pattern Anal Mach Intell. 2008 Nov;30(11):1985-97. doi: 10.1109/TPAMI.2008.129.
In this work, we systematically study the problem of event recognition in unconstrained news video sequences. We adopt the discriminative kernel-based method for which video clip similarity plays an important role. First, we represent a video clip as a bag of orderless descriptors extracted from all of the constituent frames and apply the earth mover's distance (EMD) to integrate similarities among frames from two clips. Observing that a video clip is usually comprised of multiple subclips corresponding to event evolution over time, we further build a multilevel temporal pyramid. At each pyramid level, we integrate the information from different subclips with Integer-value-constrained EMD to explicitly align the subclips. By fusing the information from the different pyramid levels, we develop Temporally Aligned Pyramid Matching (TAPM) for measuring video similarity. We conduct comprehensive experiments on the TRECVID 2005 corpus, which contains more than 6,800 clips. Our experiments demonstrate that 1) the TAPM multilevel method clearly outperforms single-level EMD (SLEMD) and 2) SLEMD outperforms keyframe and multiframe-based detection methods by a large margin. In addition, we conduct in-depth investigation of various aspects of the proposed techniques such as weight selection in SLEMD, sensitivity to temporal clustering, the effect of temporal alignment, and possible approaches for speed up. Extensive analysis of the results also reveals intuitive interpretation of video event recognition through video subclip alignment at different levels.
在这项工作中,我们系统地研究了无约束新闻视频序列中的事件识别问题。我们采用基于判别核的方法,其中视频片段相似度起着重要作用。首先,我们将视频片段表示为从所有组成帧中提取的无序描述符包,并应用推土机距离(EMD)来整合两个片段中各帧之间的相似度。鉴于视频片段通常由多个与事件随时间演变相对应的子片段组成,我们进一步构建了一个多级时间金字塔。在每个金字塔级别,我们使用整数值约束的EMD来整合来自不同子片段的信息,以明确对齐子片段。通过融合来自不同金字塔级别的信息,我们开发了时间对齐金字塔匹配(TAPM)来测量视频相似度。我们在包含超过6800个片段的TRECVID 2005语料库上进行了全面的实验。我们的实验表明:1)TAPM多级方法明显优于单级EMD(SLEMD);2)SLEMD比基于关键帧和多帧的检测方法有大幅提升。此外,我们对所提出技术的各个方面进行了深入研究,例如SLEMD中的权重选择、对时间聚类的敏感性、时间对齐的效果以及可能的加速方法。对结果的广泛分析还揭示了通过不同级别视频子片段对齐对视频事件识别的直观解释。