Hu Huangshui, Cao Yu, Fang Yue, Meng Zhiqiang
College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China.
Sensors (Basel). 2025 Mar 15;25(6):1841. doi: 10.3390/s25061841.
The skeleton-based action recognition networks often focus on extracting features such as joints from samples, while neglecting the semantic relationships inherent in actions, which also contain valuable information. To address the lack of utilization of semantic information, this paper proposes a semantics-assisted training graph convolution network (SAT-GCN). By dividing the features outputted by the skeleton encoder into four parts and contrasting them with the text features generated by the text encoder, the obtained contrastive loss is used to guide the overall network training. This approach effectively improves recognition accuracy while reducing the number of model parameters. In addition, angle features are incorporated into the skeleton model input to aid in classifying similar actions. Finally, a multi-feature skeleton encoder is designed to separately extract features such as joints, bones, and angles. These extracted features are then integrated through feature fusion. The fused features are then passed through three graph convolution blocks before being fed into fully connected layers for classification. Extensive experiments were conducted on three large-scale datasets, NTU RGB + D 60, NTU RGB + D 120, and NW-UCLA to validate the performance of the proposed model. The results show that the SAT-GCN outperforms others in terms of both accuracy and number of parameters.
基于骨架的动作识别网络通常专注于从样本中提取诸如关节等特征,却忽略了动作中固有的语义关系,而这些语义关系也包含有价值的信息。为了解决语义信息利用不足的问题,本文提出了一种语义辅助训练图卷积网络(SAT-GCN)。通过将骨架编码器输出的特征分为四个部分,并将它们与文本编码器生成的文本特征进行对比,所获得的对比损失用于指导整个网络的训练。这种方法在减少模型参数数量的同时有效提高了识别准确率。此外,将角度特征纳入骨架模型输入以辅助对相似动作进行分类。最后,设计了一个多特征骨架编码器来分别提取诸如关节、骨骼和角度等特征。然后通过特征融合将这些提取的特征进行整合。融合后的特征在被送入全连接层进行分类之前先经过三个图卷积块。在三个大规模数据集NTU RGB + D 60、NTU RGB + D 120和NW-UCLA上进行了大量实验,以验证所提出模型的性能。结果表明,SAT-GCN在准确率和参数数量方面均优于其他模型。