用于RGB-D运动识别的统一多模态解耦与再耦合框架

Zhou Benjia, Wang Pichao, Wan Jun, Liang Yanyan, Wang Fan

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11428-11442. doi: 10.1109/TPAMI.2023.3274783. Epub 2023 Sep 5.

Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations leads to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this article. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of ↑ 4.5% on the Chalearn IsoGD dataset.

运动识别是计算机视觉中一个很有前景的方向，但由于数据不足和参数众多，视频分类模型的训练比图像训练要困难得多。为了解决这个问题，一些工作致力于从RGB-D数据中探索多模态线索。尽管在一定程度上提高了运动识别能力，但这些方法在以下几个方面仍然面临次优情况：（i）数据增强，即RGB-D数据集的规模仍然有限，并且很少有人努力探索针对视频的新型数据增强策略；（ii）优化机制，即紧密的时空纠缠网络结构给时空信息建模带来了更多挑战；以及（iii）跨模态知识融合，即多模态表示之间的高度相似性导致后期融合不足。为了缓解这些缺点，我们在本文中建议从数据和算法两个角度改进基于RGB-D的运动识别。更详细地说，首先，我们引入了一种名为ShuffleMix的新型视频数据增强方法，它作为MixUp的补充，为运动识别提供额外的时间正则化。其次，提出了一种统一的多模态解耦和多阶段再耦合框架，称为UMDR，用于视频表示学习。最后，探索了一种新型的跨模态互补特征捕捉器（CFCer），以挖掘多模态信息中的潜在共性特征作为辅助融合流，以改善后期融合结果。这些新颖设计的无缝结合形成了强大的时空表示，并在四个公共运动数据集上取得了比现有方法更好的性能。具体而言，UMDR在Chalearn IsoGD数据集上实现了前所未有的↑4.5%的提升。

相似文献

A Unified Multimodal De- and Re-Coupling Framework for RGB-D Motion Recognition.

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11428-11442. doi: 10.1109/TPAMI.2023.3274783. Epub 2023 Sep 5.

UTDNet: A unified triplet decoder network for multimodal salient object detection.

Neural Netw. 2024 Feb;170:521-534. doi: 10.1016/j.neunet.2023.11.051. Epub 2023 Nov 24.

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3522-3538. doi: 10.1109/TPAMI.2022.3177813. Epub 2023 Feb 3.

Multi-scale and attention enhanced graph convolution network for skeleton-based violence action recognition.

Front Neurorobot. 2022 Dec 15;16:1091361. doi: 10.3389/fnbot.2022.1091361. eCollection 2022.

Learning Effective RGB-D Representations for Scene Recognition.

IEEE Trans Image Process. 2018 Sep 28. doi: 10.1109/TIP.2018.2872629.

Multi-modal deep learning networks for RGB-D pavement waste detection and recognition.

Waste Manag. 2024 Apr 1;177:125-134. doi: 10.1016/j.wasman.2024.01.047. Epub 2024 Feb 6.

Discriminative Relational Representation Learning for RGB-D Action Recognition.

IEEE Trans Image Process. 2016 Jun;25(6):2856-2865. doi: 10.1109/TIP.2016.2556940. Epub 2016 Apr 20.

Multimodal Art Pose Recognition and Interaction With Human Intelligence Enhancement.

Front Psychol. 2021 Nov 8;12:769509. doi: 10.3389/fpsyg.2021.769509. eCollection 2021.

Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition.

IEEE Trans Pattern Anal Mach Intell. 2016 Aug;38(8):1626-39. doi: 10.1109/TPAMI.2015.2513479. Epub 2015 Dec 30.

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos.

IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1045-1058. doi: 10.1109/TPAMI.2017.2691321. Epub 2017 Apr 5.

引用本文的文献

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition.

Sensors (Basel). 2024 Jul 26;24(15):4860. doi: 10.3390/s24154860.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

A Unified Multimodal De- and Re-Coupling Framework for RGB-D Motion Recognition.

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11428-11442. doi: 10.1109/TPAMI.2023.3274783. Epub 2023 Sep 5.

UTDNet: A unified triplet decoder network for multimodal salient object detection.

Neural Netw. 2024 Feb;170:521-534. doi: 10.1016/j.neunet.2023.11.051. Epub 2023 Nov 24.

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3522-3538. doi: 10.1109/TPAMI.2022.3177813. Epub 2023 Feb 3.

Multi-scale and attention enhanced graph convolution network for skeleton-based violence action recognition.

Front Neurorobot. 2022 Dec 15;16:1091361. doi: 10.3389/fnbot.2022.1091361. eCollection 2022.

Learning Effective RGB-D Representations for Scene Recognition.

IEEE Trans Image Process. 2018 Sep 28. doi: 10.1109/TIP.2018.2872629.

Multi-modal deep learning networks for RGB-D pavement waste detection and recognition.

Waste Manag. 2024 Apr 1;177:125-134. doi: 10.1016/j.wasman.2024.01.047. Epub 2024 Feb 6.

Discriminative Relational Representation Learning for RGB-D Action Recognition.

IEEE Trans Image Process. 2016 Jun;25(6):2856-2865. doi: 10.1109/TIP.2016.2556940. Epub 2016 Apr 20.

Multimodal Art Pose Recognition and Interaction With Human Intelligence Enhancement.

Front Psychol. 2021 Nov 8;12:769509. doi: 10.3389/fpsyg.2021.769509. eCollection 2021.

Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition.

IEEE Trans Pattern Anal Mach Intell. 2016 Aug;38(8):1626-39. doi: 10.1109/TPAMI.2015.2513479. Epub 2015 Dec 30.

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos.

IEEE Trans Pattern Anal Mach Intell. 2018 May;40(5):1045-1058. doi: 10.1109/TPAMI.2017.2691321. Epub 2017 Apr 5.

引用本文的文献

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition.

Sensors (Basel). 2024 Jul 26;24(15):4860. doi: 10.3390/s24154860.

A Unified Multimodal De- and Re-Coupling Framework for RGB-D Motion Recognition.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献