Suppr超能文献

MMNet:一种基于模型的 RGB-D 视频人体动作识别多模态网络。

MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3522-3538. doi: 10.1109/TPAMI.2022.3177813. Epub 2023 Feb 3.

Abstract

Human action recognition (HAR) in RGB-D videos has been widely investigated since the release of affordable depth sensors. Currently, unimodal approaches (e.g., skeleton-based and RGB video-based) have realized substantial improvements with increasingly larger datasets. However, multimodal methods specifically with model-level fusion have seldom been investigated. In this article, we propose a model-based multimodal network (MMNet) that fuses skeleton and RGB modalities via a model-based approach. The objective of our method is to improve ensemble recognition accuracy by effectively applying mutually complementary information from different data modalities. For the model-based fusion scheme, we use a spatiotemporal graph convolution network for the skeleton modality to learn attention weights that will be transferred to the network of the RGB modality. Extensive experiments are conducted on five benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, Northwestern-UCLA Multiview, and Toyota Smarthome. Upon aggregating the results of multiple modalities, our method is found to outperform state-of-the-art approaches on six evaluation protocols of the five datasets; thus, the proposed MMNet can effectively capture mutually complementary features in different RGB-D video modalities and provide more discriminative features for HAR. We also tested our MMNet on an RGB video dataset Kinetics 400 that contains more outdoor actions, which shows consistent results with those of RGB-D video datasets.

摘要

基于 RGB-D 视频的人体动作识别(HAR)自价格合理的深度传感器发布以来就受到了广泛的研究。目前,基于单一模态的方法(例如基于骨架的和基于 RGB 视频的)在使用越来越大的数据集时已经实现了实质性的改进。然而,基于模型级融合的多模态方法很少被研究。在本文中,我们提出了一种基于模型的多模态网络(MMNet),通过基于模型的方法融合骨架和 RGB 模态。我们方法的目标是通过有效地应用来自不同数据模态的互补信息来提高集成识别精度。对于基于模型的融合方案,我们使用时空图卷积网络对骨架模态进行学习,以学习注意力权重,这些权重将被转移到 RGB 模态的网络中。我们在五个基准数据集上进行了广泛的实验:NTU RGB+D 60、NTU RGB+D 120、PKU-MMD、Northwestern-UCLA Multiview 和 Toyota Smarthome。在聚合多个模态的结果后,我们的方法在五个数据集的六个评估协议上优于最先进的方法;因此,所提出的 MMNet 可以有效地捕获不同 RGB-D 视频模态中的互补特征,并为 HAR 提供更具判别性的特征。我们还在包含更多户外动作的 RGB 视频数据集 Kinetics 400 上测试了我们的 MMNet,结果与 RGB-D 视频数据集一致。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验