Ren Bin, Liu Mengyuan, Ding Runwei, Liu Hong
University of Pisa, Pisa, Italy.
University of Trento, Trento, Italy.
Cyborg Bionic Syst. 2024 May 16;5:0100. doi: 10.34133/cbsystems.0100. eCollection 2024.
Three-dimensional skeleton-based action recognition (3D SAR) has gained important attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or red-green-blue (RGB) data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional (3D) skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures, i.e., recurrent neural networks, convolutional neural networks, graph convolutional network, and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.
基于三维骨骼的动作识别(3D SAR)因其骨骼数据所具有的固有优势而在计算机视觉领域受到了广泛关注。因此,多年来已经开展了大量令人印象深刻的工作,包括基于传统手工特征和学习特征提取方法的研究。然而,先前关于动作识别的综述主要集中在视频或红绿蓝(RGB)数据主导的方法上,对与骨骼数据相关的综述覆盖有限。此外,尽管深度学习方法在该领域得到了广泛应用,但从深度学习架构的角度进行入门或全面综述的研究却明显缺失。为了克服这些局限性,本综述首先强调了动作识别的重要性,并强调了三维(3D)骨骼数据作为一种有价值模态的重要性。随后,我们基于四种基本的深度架构,即循环神经网络、卷积神经网络、图卷积网络和Transformer,对主流动作识别技术进行了全面介绍。然后,所有具有相应架构的方法都以数据驱动的方式呈现,并进行了详细讨论。最后,我们深入探讨了当前最大的3D骨骼数据集NTU-RGB+D及其新版本NTU-RGB+D 120,以及这些数据集上几种表现最佳的算法概述。据我们所知,本研究首次对基于深度学习的3D骨骼数据动作识别进行了全面讨论。