Song Sijie, Liu Jiaying, Li Yanghao, Guo Zongming
IEEE Trans Image Process. 2020 Jan 23. doi: 10.1109/TIP.2020.2967577.
With the prevalence of RGB-D cameras, multimodal video data have become more available for human action recognition. One main challenge for this task lies in how to effectively leverage their complementary information. In this work, we propose a Modality Compensation Network (MCN) to explore the relationships of different modalities, and boost the representations for human action recognition. We regard RGB/ optical flow videos as source modalities, skeletons as auxiliary modality. Our goal is to extract more discriminative features from source modalities, with the help of auxiliary modality. Built on deep Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks, our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning, that the network learns to compensate for the loss of skeletons at test time and even at training time. We explore multiple adaptation schemes to narrow the distance between source and auxiliary modal distributions from different levels, according to the alignment of source and auxiliary data in training. In addition, skeletons are only required in the training phase. Our model is able to improve the recognition performance with source data when testing. Experimental results reveal that MCN outperforms stateof- the-art approaches on four widely-used action recognition benchmarks.
随着RGB-D相机的普及,多模态视频数据在人类动作识别中变得更加可用。这项任务的一个主要挑战在于如何有效地利用它们的互补信息。在这项工作中,我们提出了一种模态补偿网络(MCN)来探索不同模态之间的关系,并增强人类动作识别的表征。我们将RGB/光流视频视为源模态,骨架视为辅助模态。我们的目标是借助辅助模态从源模态中提取更具判别力的特征。基于深度卷积神经网络(CNN)和长短期记忆(LSTM)网络构建,我们的模型通过一个模态自适应模块连接来自源模态和辅助模态的数据,以实现自适应表征学习,即网络学会在测试时甚至在训练时补偿骨架的缺失。根据训练中源数据和辅助数据的对齐情况,我们探索了多种自适应方案,从不同层面缩小源模态和辅助模态分布之间的距离。此外,仅在训练阶段需要骨架数据。我们的模型在测试时能够利用源数据提高识别性能。实验结果表明,在四个广泛使用的动作识别基准上,MCN优于当前的先进方法。