Zhang Jingtian, Shum Hubert P H, Han Jungong, Shao Ling
IEEE Trans Image Process. 2018 May 15. doi: 10.1109/TIP.2018.2836323.
Human action recognition is crucial to many practical applications, ranging from human-computer interaction to video surveillance. Most approaches either recognize the human action from a fixed view or require the knowledge of view angle, which is usually not available in practical applications. In this paper, we propose a novel end-to-end framework to jointly learn a view-invariance transfer dictionary and a view-invariant classifier. The result of the process is a dictionary that can project real-world 2D video into a view-invariant sparse representation, as well as a classifier to recognize actions with an arbitrary view. The main feature of our algorithm is the use of synthetic data to extract view-invariance between 3D and 2D videos during the pre-training phase. This guarantees the availability of training data, and removes the hassle of obtaining real-world videos in specific viewing angles. Additionally, for better describing the actions in 3D videos, we introduce a new feature set called the 3D dense trajectories to effectively encode extracted trajectory information on 3D videos. Experimental results on the IXMAS, N-UCLA, i3DPost and UWA3DII datasets show improvements over existing algorithms.
人类动作识别对于许多实际应用至关重要,涵盖从人机交互到视频监控等领域。大多数方法要么从固定视角识别人类动作,要么需要视角知识,而这在实际应用中通常难以获取。在本文中,我们提出了一种新颖的端到端框架,用于联合学习视角不变性转移字典和视角不变分类器。该过程的结果是一个能够将真实世界的二维视频投影到视角不变稀疏表示的字典,以及一个用于识别任意视角动作的分类器。我们算法的主要特点是在预训练阶段使用合成数据来提取三维和二维视频之间的视角不变性。这保证了训练数据的可用性,并消除了获取特定视角真实世界视频的麻烦。此外,为了更好地描述三维视频中的动作,我们引入了一个名为三维密集轨迹的新特征集,以有效地编码在三维视频上提取的轨迹信息。在IXMAS、N-UCLA、i3DPost和UWA3DII数据集上的实验结果表明,该方法相较于现有算法有改进。