Nanyang Technological University, N4-02a-29, Nanyang Avenue, Singapore 639798.
IEEE Trans Pattern Anal Mach Intell. 2012 Sep;34(9):1667-80. doi: 10.1109/TPAMI.2011.265.
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MK- using the prelearned classifiers only from each individual event class.
我们提出了一种利用大量松散标记的网络视频(例如,来自 YouTube)进行消费类视频的视觉事件识别框架。观察到消费类视频通常在同一类型的事件中包含较大的类内变化,我们首先提出了一种新方法,称为对齐时空金字塔匹配(ASTPM),以测量任意两个视频剪辑之间的距离。其次,我们提出了一种新的迁移学习方法,称为自适应多核学习(A-MKL),以 1)融合来自多个金字塔层和特征(即时空特征和静态 SIFT 特征)的信息,2)处理来自两个域(即网络视频域和消费视频域)的视频之间特征分布的相当大的变化。对于每个金字塔层和每种类型的局部特征,我们首先使用来自不同核类型和参数的多个基本核在来自两个域的组合训练集上训练一组 SVM 分类器,然后以相等的权重融合以获得预学习的平均分类器。在 A-MKL 中,对于每个事件类,我们基于多个基本核和来自该事件类或所有事件类的预学习平均分类器学习一个自适应目标分类器,通过最小化结构风险函数和两个域的数据分布之间的失配来实现。大量实验证明了我们的框架的有效性,该框架仅需要利用网络数据的少量标记的消费视频。我们还对所提出的方法 A-MKL 的各个方面进行了深入研究,例如对预学习分类器的组合系数的分析、学习算法的收敛性以及使用不同比例的标记消费视频的性能变化。此外,我们表明,与仅使用每个单独事件类的预学习分类器的 A-MKL 相比,使用所有事件类的预学习分类器的 A-MKL 可以获得更好的性能。