• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

AMS-Net:用于视频动作识别的自适应多粒度时空线索建模

AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition.

作者信息

Wang Qilong, Hu Qiyao, Gao Zilin, Li Peihua, Hu Qinghua

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):18731-18745. doi: 10.1109/TNNLS.2023.3321141. Epub 2024 Dec 2.

DOI:10.1109/TNNLS.2023.3321141
PMID:37824318
Abstract

Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input [namely, adaptive multi-granularity spatio-temporal network (AMS-Net)] to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.

摘要

作为视频表示学习核心的有效时空建模面临着视频中时空线索复杂的尺度变化挑战,特别是动作的不同视觉节奏和移动物体变化的空间大小。现有的大多数工作基于输入级或特征级金字塔机制来处理复杂的时空尺度变化,然而,这些机制依赖于昂贵的多流架构或以固定方式探索多尺度时空特征。为了以高效的方式有效捕捉时空线索的复杂尺度动态,本文提出了一种单流架构(SS-Arch.),其具有单输入[即自适应多粒度时空网络(AMS-Net)],用于为视频动作识别建模自适应多粒度(Multi-Gran.)时空线索。为此,我们的AMS-Net提出了两个核心组件,即竞争渐进式时间建模(CPTM)块和协作时空金字塔(CSTP)模块。它们分别捕捉细粒度的时间线索并以自适应方式融合粗粒度的时空特征。公认的是,AMS-Net可以在统一架构中处理视觉节奏的细微变化和适度大小的时空动态。请注意,我们的AMS-Net可以基于现有的深度卷积神经网络(CNN),通过所提出的CPTM块和CSTP模块灵活实例化。实验在八个视频基准上进行,结果表明我们的AMS-Net在细粒度动作识别(即Diving48和FineGym)方面建立了最新的(SOTA)性能,同时在广泛使用的Something-Something和Kinetics上表现出很强的竞争力。

相似文献

1
AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition.AMS-Net:用于视频动作识别的自适应多粒度时空线索建模
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):18731-18745. doi: 10.1109/TNNLS.2023.3321141. Epub 2024 Dec 2.
2
Spatial-Temporal Pyramid Graph Reasoning for Action Recognition.用于动作识别的时空金字塔图推理
IEEE Trans Image Process. 2022;31:5484-5497. doi: 10.1109/TIP.2022.3196175. Epub 2022 Aug 22.
3
Motion-Driven Visual Tempo Learning for Video-Based Action Recognition.基于运动驱动的视觉节奏学习的视频动作识别。
IEEE Trans Image Process. 2022;31:4104-4116. doi: 10.1109/TIP.2022.3180585. Epub 2022 Jun 20.
4
A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention.基于空间注意力的用于动作识别的时空运动网络
Entropy (Basel). 2022 Mar 4;24(3):368. doi: 10.3390/e24030368.
5
Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning.基于图的多粒度交互学习的细粒度视频字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):666-683. doi: 10.1109/TPAMI.2019.2946823. Epub 2022 Jan 7.
6
Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video.基于骨架的时空 U-Net 网络用于视频中的 3D 人体姿态估计。
Sensors (Basel). 2022 Mar 28;22(7):2573. doi: 10.3390/s22072573.
7
Multi-Scale Spatio-Temporal Memory Network for Lightweight Video Denoising.用于轻量级视频去噪的多尺度时空记忆网络
IEEE Trans Image Process. 2024;33:5810-5823. doi: 10.1109/TIP.2024.3444315. Epub 2024 Oct 15.
8
Gate-Shift-Fuse for Video Action Recognition.用于视频动作识别的门控移位融合器
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10913-10928. doi: 10.1109/TPAMI.2023.3268134. Epub 2023 Aug 7.
9
Research on Multi-Scale Spatio-Temporal Graph Convolutional Human Behavior Recognition Method Incorporating Multi-Granularity Features.融合多粒度特征的多尺度时空图卷积人体行为识别方法研究
Sensors (Basel). 2024 Nov 28;24(23):7595. doi: 10.3390/s24237595.
10
MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module.MEST:一种具有运动编码器和时空模块的动作识别网络。
Sensors (Basel). 2022 Sep 1;22(17):6595. doi: 10.3390/s22176595.