Zhang Pengfei, Lan Cuiling, Xing Junliang, Zeng Wenjun, Xue Jianru, Zheng Nanning
IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1963-1978. doi: 10.1109/TPAMI.2019.2896631. Epub 2019 Jan 31.
Skeleton-based human action recognition has recently attracted increasing attention thanks to the accessibility and the popularity of 3D skeleton data. One of the key challenges in action recognition lies in the large variations of action representations when they are captured from different viewpoints. In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner. Instead of re-positioning the skeletons using a fixed human-defined prior criterion, we design two view adaptive neural networks, i.e., VA-RNN and VA-CNN, which are respectively built based on the recurrent neural network (RNN) with the Long Short-term Memory (LSTM) and the convolutional neural network (CNN). For each network, a novel view adaptation module learns and determines the most suitable observation viewpoints, and transforms the skeletons to those viewpoints for the end-to-end recognition with a main classification network. Ablation studies find that the proposed view adaptive models are capable of transforming the skeletons of various views to much more consistent virtual viewpoints. Therefore, the models largely eliminate the influence of the viewpoints, enabling the networks to focus on the learning of action-specific features and thus resulting in superior performance. In addition, we design a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance. Moreover, random rotation of skeleton sequences is employed to improve the robustness of view adaptation models and alleviate overfitting during training. Extensive experimental evaluations on five challenging benchmarks demonstrate the effectiveness of the proposed view-adaptive networks and superior performance over state-of-the-art approaches.
基于骨骼的人体动作识别近年来因3D骨骼数据的可获取性和普及性而受到越来越多的关注。动作识别中的关键挑战之一在于从不同视角捕捉动作表示时存在的巨大变化。为了减轻视角变化的影响,本文引入了一种新颖的视角适应方案,该方案以基于学习的数据驱动方式在动作过程中自动确定虚拟观察视角。我们不是使用固定的人为定义的先验标准来重新定位骨骼,而是设计了两个视角自适应神经网络,即VA - RNN和VA - CNN,它们分别基于带有长短期记忆(LSTM)的递归神经网络(RNN)和卷积神经网络(CNN)构建。对于每个网络,一个新颖的视角适应模块学习并确定最合适的观察视角,并将骨骼转换到这些视角,以便与主分类网络进行端到端识别。消融研究发现,所提出的视角自适应模型能够将各种视角的骨骼转换为更加一致的虚拟视角。因此,这些模型在很大程度上消除了视角的影响,使网络能够专注于动作特定特征的学习,从而获得卓越的性能。此外,我们设计了一种双流方案(称为VA融合),融合两个网络的得分以提供最终预测,从而获得增强的性能。此外,采用骨骼序列的随机旋转来提高视角适应模型的鲁棒性并减轻训练期间的过拟合。在五个具有挑战性的基准上进行的广泛实验评估证明了所提出的视角自适应网络的有效性以及相对于现有最先进方法的卓越性能。