Yang Huaigang, Ren Ziliang, Yuan Huaqiang, Wei Wenhong, Zhang Qieshi, Zhang Zhaolong
School of Computer Science and Technology, Dongguan University of Technology, Dongguan, China.
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
Front Neurorobot. 2022 Dec 15;16:1091361. doi: 10.3389/fnbot.2022.1091361. eCollection 2022.
Graph convolution networks (GCNs) have been widely used in the field of skeleton-based human action recognition. However, it is still difficult to improve recognition performance and reduce parameter complexity. In this paper, a novel multi-scale attention spatiotemporal GCN (MSA-STGCN) is proposed for human violence action recognition by learning spatiotemporal features from four different skeleton modality variants. Firstly, the original joint data are preprocessed to obtain joint position, bone vector, joint motion and bone motion datas as inputs of recognition framework. Then, a spatial multi-scale graph convolution network based on the attention mechanism is constructed to obtain the spatial features from joint nodes, while a temporal graph convolution network in the form of hybrid dilation convolution is designed to enlarge the receptive field of the feature map and capture multi-scale context information. Finally, the specific relationship in the different skeleton data is explored by fusing the information of multi-stream related to human joints and bones. To evaluate the performance of the proposed MSA-STGCN, a skeleton violence action dataset: Filtered NTU RGB+D was constructed based on NTU RGB+D120. We conducted experiments on constructed Filtered NTU RGB+D and Kinetics Skeleton 400 datasets to verify the performance of the proposed recognition framework. The proposed method achieves an accuracy of 95.3% on the Filtered NTU RGB+D with the parameters 1.21M, and an accuracy of 36.2% (Top-1) and 58.5% (Top-5) on the Kinetics Skeleton 400, respectively. The experimental results on these two skeleton datasets show that the proposed recognition framework can effectively recognize violence actions without adding parameters.
图卷积网络(GCN)已在基于骨架的人体动作识别领域中得到广泛应用。然而,提高识别性能并降低参数复杂度仍然具有挑战性。本文提出了一种新颖的多尺度注意力时空GCN(MSA - STGCN),用于通过从四种不同的骨架模态变体中学习时空特征来进行人类暴力行为识别。首先,对原始关节数据进行预处理,以获得关节位置、骨骼向量、关节运动和骨骼运动数据,作为识别框架的输入。然后,构建基于注意力机制的空间多尺度图卷积网络,以从关节节点获取空间特征,同时设计混合扩张卷积形式的时间图卷积网络,以扩大特征图的感受野并捕获多尺度上下文信息。最后,通过融合与人体关节和骨骼相关的多流信息,探索不同骨架数据中的特定关系。为了评估所提出的MSA - STGCN的性能,基于NTU RGB + D120构建了一个骨架暴力行为数据集:Filtered NTU RGB + D。我们在构建的Filtered NTU RGB + D和Kinetics Skeleton 400数据集上进行了实验,以验证所提出的识别框架的性能。所提出的方法在Filtered NTU RGB + D上以121万个参数实现了95.3%的准确率,在Kinetics Skeleton 400上分别实现了36.2%(Top - 1)和58.5%(Top - 5)的准确率。这两个骨架数据集上的实验结果表明,所提出的识别框架可以在不增加参数的情况下有效地识别暴力行为。