Zhou Benjia, Wang Pichao, Wan Jun, Liang Yanyan, Wang Fan
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11428-11442. doi: 10.1109/TPAMI.2023.3274783. Epub 2023 Sep 5.
Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations leads to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this article. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of ↑ 4.5% on the Chalearn IsoGD dataset.
运动识别是计算机视觉中一个很有前景的方向,但由于数据不足和参数众多,视频分类模型的训练比图像训练要困难得多。为了解决这个问题,一些工作致力于从RGB-D数据中探索多模态线索。尽管在一定程度上提高了运动识别能力,但这些方法在以下几个方面仍然面临次优情况:(i)数据增强,即RGB-D数据集的规模仍然有限,并且很少有人努力探索针对视频的新型数据增强策略;(ii)优化机制,即紧密的时空纠缠网络结构给时空信息建模带来了更多挑战;以及(iii)跨模态知识融合,即多模态表示之间的高度相似性导致后期融合不足。为了缓解这些缺点,我们在本文中建议从数据和算法两个角度改进基于RGB-D的运动识别。更详细地说,首先,我们引入了一种名为ShuffleMix的新型视频数据增强方法,它作为MixUp的补充,为运动识别提供额外的时间正则化。其次,提出了一种统一的多模态解耦和多阶段再耦合框架,称为UMDR,用于视频表示学习。最后,探索了一种新型的跨模态互补特征捕捉器(CFCer),以挖掘多模态信息中的潜在共性特征作为辅助融合流,以改善后期融合结果。这些新颖设计的无缝结合形成了强大的时空表示,并在四个公共运动数据集上取得了比现有方法更好的性能。具体而言,UMDR在Chalearn IsoGD数据集上实现了前所未有的↑4.5%的提升。