Suppr超能文献

AMS-Net:用于视频动作识别的自适应多粒度时空线索建模

AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition.

作者信息

Wang Qilong, Hu Qiyao, Gao Zilin, Li Peihua, Hu Qinghua

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):18731-18745. doi: 10.1109/TNNLS.2023.3321141. Epub 2024 Dec 2.

Abstract

Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input [namely, adaptive multi-granularity spatio-temporal network (AMS-Net)] to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.

摘要

作为视频表示学习核心的有效时空建模面临着视频中时空线索复杂的尺度变化挑战,特别是动作的不同视觉节奏和移动物体变化的空间大小。现有的大多数工作基于输入级或特征级金字塔机制来处理复杂的时空尺度变化,然而,这些机制依赖于昂贵的多流架构或以固定方式探索多尺度时空特征。为了以高效的方式有效捕捉时空线索的复杂尺度动态,本文提出了一种单流架构(SS-Arch.),其具有单输入[即自适应多粒度时空网络(AMS-Net)],用于为视频动作识别建模自适应多粒度(Multi-Gran.)时空线索。为此,我们的AMS-Net提出了两个核心组件,即竞争渐进式时间建模(CPTM)块和协作时空金字塔(CSTP)模块。它们分别捕捉细粒度的时间线索并以自适应方式融合粗粒度的时空特征。公认的是,AMS-Net可以在统一架构中处理视觉节奏的细微变化和适度大小的时空动态。请注意,我们的AMS-Net可以基于现有的深度卷积神经网络(CNN),通过所提出的CPTM块和CSTP模块灵活实例化。实验在八个视频基准上进行,结果表明我们的AMS-Net在细粒度动作识别(即Diving48和FineGym)方面建立了最新的(SOTA)性能,同时在广泛使用的Something-Something和Kinetics上表现出很强的竞争力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验