Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, GU2 7XH, UK.
IEEE Trans Pattern Anal Mach Intell. 2011 May;33(5):883-97. doi: 10.1109/TPAMI.2010.144.
The field of Action Recognition has seen a large increase in activity in recent years. Much of the progress has been through incorporating ideas from single-frame object recognition and adapting them for temporal-based action recognition. Inspired by the success of interest points in the 2D spatial domain, their 3D (space-time) counterparts typically form the basic components used to describe actions, and in action recognition the features used are often engineered to fire sparsely. This is to ensure that the problem is tractable; however, this can sacrifice recognition accuracy as it cannot be assumed that the optimum features in terms of class discrimination are obtained from this approach. In contrast, we propose to initially use an overcomplete set of simple 2D corners in both space and time. These are grouped spatially and temporally using a hierarchical process, with an increasing search area. At each stage of the hierarchy, the most distinctive and descriptive features are learned efficiently through data mining. This allows large amounts of data to be searched for frequently reoccurring patterns of features. At each level of the hierarchy, the mined compound features become more complex, discriminative, and sparse. This results in fast, accurate recognition with real-time performance on high-resolution video. As the compound features are constructed and selected based upon their ability to discriminate, their speed and accuracy increase at each level of the hierarchy. The approach is tested on four state-of-the-art data sets, the popular KTH data set to provide a comparison with other state-of-the-art approaches, the Multi-KTH data set to illustrate performance at simultaneous multiaction classification, despite no explicit localization information provided during training. Finally, the recent Hollywood and Hollywood2 data sets provide challenging complex actions taken from commercial movie sequences. For all four data sets, the proposed hierarchical approach outperforms all other methods reported thus far in the literature and can achieve real-time operation.
近年来,动作识别领域的活跃度大幅提高。其中大部分进展是通过整合单帧目标识别的思想并将其应用于基于时间的动作识别。受二维空间中兴趣点成功的启发,它们的三维(时空)对应物通常构成用于描述动作的基本组成部分,在动作识别中,使用的特征通常经过精心设计,以稀疏方式触发。这是为了确保问题是可处理的;然而,这可能会牺牲识别准确性,因为不能假设从这种方法中获得最佳的分类判别特征。相比之下,我们最初提议在空间和时间上同时使用一组过度完备的简单二维角点。这些角点使用分层过程进行空间和时间分组,搜索区域逐渐增大。在层次结构的每个阶段,通过数据挖掘高效地学习最具特色和描述性的特征。这允许搜索大量数据以查找频繁出现的特征模式。在层次结构的每个级别,挖掘出的复合特征变得更加复杂、具有区分性和稀疏。这使得在高分辨率视频上实现快速、准确的识别,并具有实时性能。由于复合特征是根据其判别能力构建和选择的,因此它们在层次结构的每个级别上的速度和准确性都在提高。该方法在四个最先进的数据集上进行了测试,包括流行的 KTH 数据集,以便与其他最先进的方法进行比较,以及 Multi-KTH 数据集,以说明在训练过程中没有提供显式定位信息的情况下同时进行多动作分类的性能。最后,最近的 Hollywood 和 Hollywood2 数据集提供了来自商业电影序列的具有挑战性的复杂动作。对于所有四个数据集,所提出的分层方法都优于迄今为止文献中报道的所有其他方法,并能够实现实时操作。