Yang Hao, Yan Dan, Zhang Li, Sun Yunda, Li Dong, Maybank Stephen J
IEEE Trans Image Process. 2022;31:164-175. doi: 10.1109/TIP.2021.3129117. Epub 2021 Dec 2.
Skeleton-based action recognition has attracted considerable attention since the skeleton data is more robust to the dynamic circumstances and complicated backgrounds than other modalities. Recently, many researchers have used the Graph Convolutional Network (GCN) to model spatial-temporal features of skeleton sequences by an end-to-end optimization. However, conventional GCNs are feedforward networks for which it is impossible for the shallower layers to access semantic information in the high-level layers. In this paper, we propose a novel network, named Feedback Graph Convolutional Network (FGCN). This is the first work that introduces a feedback mechanism into GCNs for action recognition. Compared with conventional GCNs, FGCN has the following advantages: (1) A multi-stage temporal sampling strategy is designed to extract spatial-temporal features for action recognition in a coarse to fine process; (2) A Feedback Graph Convolutional Block (FGCB) is proposed to introduce dense feedback connections into the GCNs. It transmits the high-level semantic features to the shallower layers and conveys temporal information stage by stage to model video level spatial-temporal features for action recognition; (3) The FGCN model provides predictions on-the-fly. In the early stages, its predictions are relatively coarse. These coarse predictions are treated as priors to guide the feature learning in later stages, to obtain more accurate predictions. Extensive experiments on three datasets, NTU-RGB+D, NTU-RGB+D120 and Northwestern-UCLA, demonstrate that the proposed FGCN is effective for action recognition. It achieves the state-of-the-art performance on all three datasets.
基于骨骼的动作识别已经引起了相当大的关注,因为与其他模态相比,骨骼数据在动态环境和复杂背景下更具鲁棒性。最近,许多研究人员使用图卷积网络(GCN)通过端到端优化来对骨骼序列的时空特征进行建模。然而,传统的GCN是前馈网络,较浅的层无法访问高层中的语义信息。在本文中,我们提出了一种新颖的网络,称为反馈图卷积网络(FGCN)。这是第一项将反馈机制引入GCN进行动作识别的工作。与传统的GCN相比,FGCN具有以下优点:(1)设计了一种多阶段时间采样策略,以从粗到细的过程中提取用于动作识别的时空特征;(2)提出了一种反馈图卷积块(FGCB),将密集反馈连接引入GCN。它将高级语义特征传输到较浅的层,并逐阶段传达时间信息,以对用于动作识别的视频级时空特征进行建模;(3)FGCN模型实时提供预测。在早期阶段,其预测相对粗糙。这些粗糙的预测被视为先验信息,以指导后期的特征学习,从而获得更准确的预测。在NTU-RGB+D、NTU-RGB+D120和西北大学UCLA这三个数据集上进行的大量实验表明,所提出的FGCN对动作识别是有效的。它在所有三个数据集上都取得了领先的性能。