Liu Mengyuan, Meng Fanyang, Liang Yongsheng
Key Laboratory of Machine Perception, Peking University, Shenzhen Graduate School, Shenzhen, China.
Peng Cheng Laboratory, Shenzhen, China.
Cyborg Bionic Syst. 2022;2022:0002. doi: 10.34133/cbsystems.0002. Epub 2022 Dec 30.
Human action representation is derived from the description of human shape and motion. The traditional unsupervised 3-dimensional (3D) human action representation learning method uses a recurrent neural network (RNN)-based autoencoder to reconstruct the input pose sequence and then takes the midlevel feature of the autoencoder as representation. Although RNN can implicitly learn a certain amount of motion information, the extracted representation mainly describes the human shape and is insufficient to describe motion information. Therefore, we first present a handcrafted motion feature called pose flow to guide the reconstruction of the autoencoder, whose midlevel feature is expected to describe motion information. The performance is limited as we observe that actions can be distinctive in either motion direction or motion norm. For example, we can distinguish "sitting down" and "standing up" from motion direction yet distinguish "running" and "jogging" from motion norm. In these cases, it is difficult to learn distinctive features from pose flow where direction and norm are mixed. To this end, we present an explicit pose decoupled flow network (PDF-E) to learn from direction and norm in a multi-task learning framework, where 1 encoder is used to generate representation and 2 decoders are used to generating direction and norm, respectively. Further, we use reconstructing the input pose sequence as an additional constraint and present a generalized PDF network (PDF-G) to learn both motion and shape information, which achieves state-of-the-art performances on large-scale and challenging 3D action recognition datasets including the NTU RGB+D 60 dataset and NTU RGB+D 120 dataset.
人类动作表示源于对人体形状和运动的描述。传统的无监督三维(3D)人类动作表示学习方法使用基于循环神经网络(RNN)的自动编码器来重建输入姿态序列,然后将自动编码器的中层特征作为表示。尽管RNN可以隐式地学习一定量的运动信息,但提取的表示主要描述人体形状,不足以描述运动信息。因此,我们首先提出一种名为姿态流的手工制作的运动特征,以指导自动编码器的重建,其中层特征有望描述运动信息。由于我们观察到动作在运动方向或运动规范上可能具有独特性,所以性能受到限制。例如,我们可以从运动方向上区分“坐下”和“站起来”,而从运动规范上区分“跑步”和“慢跑”。在这些情况下,很难从方向和规范混合的姿态流中学习独特的特征。为此,我们提出了一种显式姿态解耦流网络(PDF-E),在多任务学习框架中从方向和规范进行学习,其中使用1个编码器生成表示,2个解码器分别用于生成方向和规范。此外,我们将重建输入姿态序列作为额外的约束,并提出了一种广义PDF网络(PDF-G)来学习运动和形状信息,该网络在包括NTU RGB+D 60数据集和NTU RGB+D 120数据集在内的大规模且具有挑战性的3D动作识别数据集上取得了领先的性能。