IEEE Trans Pattern Anal Mach Intell. 2017 May;39(5):1028-1039. doi: 10.1109/TPAMI.2016.2565479. Epub 2016 May 10.
The advent of cost-effectiveness and easy-operation depth cameras has facilitated a variety of visual recognition tasks including human activity recognition. This paper presents a novel framework for recognizing human activities from video sequences captured by depth cameras. We extend the surface normal to polynormal by assembling local neighboring hypersurface normals from a depth sequence to jointly characterize local motion and shape information. We then propose a general scheme of super normal vector (SNV) to aggregate the low-level polynormals into a discriminative representation, which can be viewed as a simplified version of the Fisher kernel representation. In order to globally capture the spatial layout and temporal order, an adaptive spatio-temporal pyramid is introduced to subdivide a depth video into a set of space-time cells. In the extensive experiments, the proposed approach achieves superior performance to the state-of-the-art methods on the four public benchmark datasets, i.e., MSRAction3D, MSRDailyActivity3D, MSRGesture3D, and MSRActionPairs3D.
成本效益和易于操作的深度相机的出现促进了各种视觉识别任务,包括人类活动识别。本文提出了一种从深度相机捕获的视频序列中识别人类活动的新框架。我们通过从深度序列中组装局部邻域超曲面法向量来将表面法向量扩展到多项式法向量,从而联合描述局部运动和形状信息。然后,我们提出了一种超法向量(SNV)的通用方案,将低阶多项式聚集为一个有鉴别力的表示,这可以看作是 Fisher 核表示的简化版本。为了全局捕获空间布局和时间顺序,引入了自适应时空金字塔将深度视频细分为一组时空单元。在广泛的实验中,所提出的方法在四个公共基准数据集(即 MSRAction3D、MSRDailyActivity3D、MSRGesture3D 和 MSRActionPairs3D)上优于最先进的方法。