IEEE Trans Image Process. 2017 May;26(5):2149-2162. doi: 10.1109/TIP.2017.2670782. Epub 2017 Feb 17.
Semantic information is important for video event detection. How to automatically discover, model, and utilize semantic information to facilitate video event detection has been a challenging problem. In this paper, we propose a novel hierarchical video event detection model, which deliberately unifies the processes of underlying semantics discovery and event modeling from video data. Specially, different from most of the approaches based on manually pre-defined concepts, we devise an effective model to automatically uncover video semantics by hierarchically capturing latent static-visual concepts in frame-level and latent activity concepts (i.e., temporal sequence relationships of static-visual concepts) in segment-level. The unified model not only enables a discriminative and descriptive representation for videos, but also alleviates error propagation problem from video representation to event modeling existing in previous methods. A max-margin framework is employed to learn the model. Extensive experiments on four challenging video event datasets, i.e., MED11, CCV, UQE50, and FCVID, have been conducted to demonstrate the effectiveness of the proposed method.
语义信息对于视频事件检测很重要。如何自动发现、建模和利用语义信息来促进视频事件检测一直是一个具有挑战性的问题。在本文中,我们提出了一种新颖的分层视频事件检测模型,该模型旨在从视频数据中统一底层语义发现和事件建模的过程。具体来说,与大多数基于手动预定义概念的方法不同,我们设计了一种有效的模型,通过在帧级分层捕获潜在的静态视觉概念和在段级分层捕获潜在的活动概念(即静态视觉概念的时间序列关系),自动揭示视频语义。统一的模型不仅为视频提供了有区分性和描述性的表示,而且减轻了先前方法中从视频表示到事件建模的错误传播问题。我们采用了最大间隔框架来学习模型。在 MED11、CCV、UQE50 和 FCVID 这四个具有挑战性的视频事件数据集上进行了广泛的实验,以验证所提出方法的有效性。