Bian Cunling, Feng Wei, Wan Liang, Wang Song
IEEE Trans Image Process. 2021;30:2963-2976. doi: 10.1109/TIP.2021.3056895. Epub 2021 Feb 17.
Skeleton data have been extensively used for action recognition since they can robustly accommodate dynamic circumstances and complex backgrounds. To guarantee the action-recognition performance, we prefer to use advanced and time-consuming algorithms to get more accurate and complete skeletons from the scene. However, this may not be acceptable in time- and resource-stringent applications. In this paper, we explore the feasibility of using low-quality skeletons, which can be quickly and easily estimated from the scene, for action recognition. While the use of low-quality skeletons will surely lead to degraded action-recognition accuracy, in this paper we propose a structural knowledge distillation scheme to minimize this accuracy degradations and improve recognition model's robustness to uncontrollable skeleton corruptions. More specifically, a teacher which observes high-quality skeletons obtained from a scene is used to help train a student which only sees low-quality skeletons generated from the same scene. At inference time, only the student network is deployed for processing low-quality skeletons. In the proposed network, a graph matching loss is proposed to distill the graph structural knowledge at an intermediate representation level. We also propose a new gradient revision strategy to seek a balance between mimicking the teacher model and directly improving the student model's accuracy. Experiments are conducted on Kenetics400, NTU RGB+D and Penn action recognition datasets and the comparison results demonstrate the effectiveness of our scheme.
骨架数据因其能够稳健地适应动态环境和复杂背景而被广泛用于动作识别。为了保证动作识别性能,我们倾向于使用先进且耗时的算法从场景中获取更准确和完整的骨架。然而,这在对时间和资源要求严格的应用中可能是不可接受的。在本文中,我们探索了使用低质量骨架进行动作识别的可行性,低质量骨架可以从场景中快速且容易地估计出来。虽然使用低质量骨架肯定会导致动作识别准确率下降,但在本文中我们提出了一种结构化知识蒸馏方案,以最小化这种准确率下降,并提高识别模型对不可控骨架损坏的鲁棒性。更具体地说,一个观察从场景中获得的高质量骨架的教师模型被用来帮助训练一个只看到从同一场景生成的低质量骨架的学生模型。在推理时,只部署学生网络来处理低质量骨架。在所提出的网络中,提出了一种图匹配损失,以在中间表示层蒸馏图结构知识。我们还提出了一种新的梯度修正策略,以在模仿教师模型和直接提高学生模型准确率之间寻求平衡。在肯尼特斯400、NTU RGB+D和宾夕法尼亚动作识别数据集上进行了实验,比较结果证明了我们方案的有效性。