Suppr超能文献

非局部时域差分网络的时域动作检测。

Non-Local Temporal Difference Network for Temporal Action Detection.

机构信息

Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610081, China.

School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China.

出版信息

Sensors (Basel). 2022 Nov 1;22(21):8396. doi: 10.3390/s22218396.

Abstract

As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively.

摘要

作为视频理解的重要组成部分,时间动作检测(TAD)具有广泛的应用场景。它的目的是在未剪辑的视频中同时预测每个动作实例的边界位置和类别标签。大多数现有的时间动作检测方法采用堆叠卷积块策略来对长时结构进行建模。然而,相邻帧之间的大部分信息是冗余的,经过多次卷积操作后,远距离信息会被削弱。此外,动作实例的持续时间差异很大,使得单尺度建模难以适应复杂的视频结构。针对这个问题,我们提出了一种非局部时间差分网络(NTD),包括一个块卷积(CC)模块、一个多时间协调(MTC)模块和一个时间差分(TD)模块。TD 模块使用时间注意力权重自适应地增强运动信息和边界特征。CC 模块将输入序列均匀地分成 N 个块,使用多个独立的卷积块同时从相邻块中提取特征。因此,它实现了来自远距离帧的信息传递,同时避免了陷入局部卷积。MTC 模块设计了一个级联残差架构,在不引入额外参数的情况下实现了多尺度时间特征聚合。NTD 在两个大规模数据集上实现了最先进的性能,在 ActivityNet-v1.3 上的 mAP@avg 达到 36.2%,在 THUMOS-14 上的 mAP@0.5 达到 71.6%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f821/9655564/b9b8052c5d3a/sensors-22-08396-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验