MMNet：一种基于模型的 RGB-D 视频人体动作识别多模态网络。

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3522-3538. doi: 10.1109/TPAMI.2022.3177813. Epub 2023 Feb 3.

DOI:10.1109/TPAMI.2022.3177813

Abstract

Human action recognition (HAR) in RGB-D videos has been widely investigated since the release of affordable depth sensors. Currently, unimodal approaches (e.g., skeleton-based and RGB video-based) have realized substantial improvements with increasingly larger datasets. However, multimodal methods specifically with model-level fusion have seldom been investigated. In this article, we propose a model-based multimodal network (MMNet) that fuses skeleton and RGB modalities via a model-based approach. The objective of our method is to improve ensemble recognition accuracy by effectively applying mutually complementary information from different data modalities. For the model-based fusion scheme, we use a spatiotemporal graph convolution network for the skeleton modality to learn attention weights that will be transferred to the network of the RGB modality. Extensive experiments are conducted on five benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, Northwestern-UCLA Multiview, and Toyota Smarthome. Upon aggregating the results of multiple modalities, our method is found to outperform state-of-the-art approaches on six evaluation protocols of the five datasets; thus, the proposed MMNet can effectively capture mutually complementary features in different RGB-D video modalities and provide more discriminative features for HAR. We also tested our MMNet on an RGB video dataset Kinetics 400 that contains more outdoor actions, which shows consistent results with those of RGB-D video datasets.

摘要

基于 RGB-D 视频的人体动作识别（HAR）自价格合理的深度传感器发布以来就受到了广泛的研究。目前，基于单一模态的方法（例如基于骨架的和基于 RGB 视频的）在使用越来越大的数据集时已经实现了实质性的改进。然而，基于模型级融合的多模态方法很少被研究。在本文中，我们提出了一种基于模型的多模态网络（MMNet），通过基于模型的方法融合骨架和 RGB 模态。我们方法的目标是通过有效地应用来自不同数据模态的互补信息来提高集成识别精度。对于基于模型的融合方案，我们使用时空图卷积网络对骨架模态进行学习，以学习注意力权重，这些权重将被转移到 RGB 模态的网络中。我们在五个基准数据集上进行了广泛的实验：NTU RGB+D 60、NTU RGB+D 120、PKU-MMD、Northwestern-UCLA Multiview 和 Toyota Smarthome。在聚合多个模态的结果后，我们的方法在五个数据集的六个评估协议上优于最先进的方法；因此，所提出的 MMNet 可以有效地捕获不同 RGB-D 视频模态中的互补特征，并为 HAR 提供更具判别性的特征。我们还在包含更多户外动作的 RGB 视频数据集 Kinetics 400 上测试了我们的 MMNet，结果与 RGB-D 视频数据集一致。

相似文献

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.MMNet：一种基于模型的 RGB-D 视频人体动作识别多模态网络。

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3522-3538. doi: 10.1109/TPAMI.2022.3177813. Epub 2023 Feb 3.

Multi-scale and attention enhanced graph convolution network for skeleton-based violence action recognition.用于基于骨架的暴力行为识别的多尺度注意力增强图卷积网络。

Front Neurorobot. 2022 Dec 15;16:1091361. doi: 10.3389/fnbot.2022.1091361. eCollection 2022.

Enhancing Human Activity Recognition through Integrated Multimodal Analysis: A Focus on RGB Imaging, Skeletal Tracking, and Pose Estimation.通过集成多模态分析增强人类活动识别：重点关注 RGB 成像、骨骼跟踪和姿势估计。

Sensors (Basel). 2024 Jul 17;24(14):4646. doi: 10.3390/s24144646.

Using a Selective Ensemble Support Vector Machine to Fuse Multimodal Features for Human Action Recognition.使用选择性集成支持向量机融合多模态特征进行人体动作识别。

Comput Intell Neurosci. 2022 Jan 10;2022:1877464. doi: 10.1155/2022/1877464. eCollection 2022.

Multipe/single-view human action recognition via part-induced multitask structural learning.基于部件诱导多任务结构学习的多/单视图人体动作识别。

IEEE Trans Cybern. 2015 Jun;45(6):1194-208. doi: 10.1109/TCYB.2014.2347057. Epub 2014 Aug 27.

Learning with Privileged Information via Adversarial Discriminative Modality Distillation.通过对抗性判别模态蒸馏进行带特权信息的学习。

IEEE Trans Pattern Anal Mach Intell. 2020 Oct;42(10):2581-2593. doi: 10.1109/TPAMI.2019.2929038. Epub 2019 Jul 16.

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition.基于骨架的动作识别的多模态自适应特征融合图卷积网络。

Sensors (Basel). 2023 Jun 7;23(12):5414. doi: 10.3390/s23125414.

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding.NTU RGB+D 120：用于三维人体活动理解的大规模基准测试。

IEEE Trans Pattern Anal Mach Intell. 2020 Oct;42(10):2684-2701. doi: 10.1109/TPAMI.2019.2916873. Epub 2019 May 14.

Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition.用于基于骨架的动作识别的整体与部分自适应融合图卷积网络

Sensors (Basel). 2020 Dec 13;20(24):7149. doi: 10.3390/s20247149.

Human Action Recognition From Various Data Modalities: A Review.基于多种数据模态的人类行为识别综述

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3200-3225. doi: 10.1109/TPAMI.2022.3183112. Epub 2023 Feb 3.

引用本文的文献

A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities.跨多种数据模态的人类活动识别综合方法学综述

Sensors (Basel). 2025 Jun 27;25(13):4028. doi: 10.3390/s25134028.

SignFormer-GCN: Continuous sign language translation using spatio-temporal graph convolutional networks.SignFormer-GCN：使用时空图卷积网络的连续手语翻译

PLoS One. 2025 Feb 14;20(2):e0316298. doi: 10.1371/journal.pone.0316298. eCollection 2025.

Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7.基于卷积神经网络的人类动作识别中的多级特征融合：以EfficientNet-B7为例

J Imaging. 2024 Dec 12;10(12):320. doi: 10.3390/jimaging10120320.

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition.利用3D卷积神经网络实现高效时空学习以进行基于姿态的动作识别

Sensors (Basel). 2024 Nov 30;24(23):7682. doi: 10.3390/s24237682.

Depth Video-Based Secondary Action Recognition in Vehicles via Convolutional Neural Network and Bidirectional Long Short-Term Memory with Spatial Enhanced Attention Mechanism.基于深度视频的车辆二次动作识别：卷积神经网络和具有空间增强注意力机制的双向长短时记忆

Sensors (Basel). 2024 Oct 13;24(20):6604. doi: 10.3390/s24206604.

Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition.基于骨架的动作识别的语言驱动部分语义相关性学习。

Sensors (Basel). 2024 Jul 26;24(15):4860. doi: 10.3390/s24154860.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MMNet：一种基于模型的 RGB-D 视频人体动作识别多模态网络。

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献