• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

序贯视频 VLAD:局部和时间聚合的训练。

Sequential Video VLAD: Training the Aggregation Locally and Temporally.

出版信息

IEEE Trans Image Process. 2018 Oct;27(10):4933-4944. doi: 10.1109/TIP.2018.2846664.

DOI:10.1109/TIP.2018.2846664
PMID:29985134
Abstract

As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.

摘要

由于从空间和时间线索同时对视频进行特征描述对于视频分析至关重要,因此卷积神经网络和递归神经网络(即递归卷积网络,RCN)的组合应该是学习时空视频特征的原生框架。在本文中,我们开发了一种新的局部聚合描述符(VLAD)序列层,名为 SeqVLAD,将可训练的 VLAD 编码过程和 RCNs 架构结合到一个整体框架中。具体来说,从连续视频帧中提取的顺序卷积特征图被输入到 RCNs 中,以学习软时空分配参数,从而不仅聚合单独视频帧中的详细空间信息,而且聚合连续视频帧中的精细运动信息。此外,我们通过共享输入到隐藏的参数改进了 RCNs 的门控循环单元(GRU),并提出了一种名为共享 GRU-RCN(SGRU-RCN)的改进 GRU-RCN 架构。因此,我们的 SGRU-RCN 的参数更少,过拟合的可能性更小。在实验中,我们使用视频字幕和视频动作识别任务来评估 SeqVLAD。在 Microsoft Research Video Description Corpus、Montreal Video Annotation Dataset、UCF101 和 HMDB51 上的实验结果证明了我们方法的有效性和良好性能。

相似文献

1
Sequential Video VLAD: Training the Aggregation Locally and Temporally.序贯视频 VLAD:局部和时间聚合的训练。
IEEE Trans Image Process. 2018 Oct;27(10):4933-4944. doi: 10.1109/TIP.2018.2846664.
2
Action-Stage Emphasized Spatio-Temporal VLAD for Video Action Recognition.用于视频动作识别的动作阶段强调时空VLAD
IEEE Trans Image Process. 2019 Jan 3. doi: 10.1109/TIP.2018.2890749.
3
Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.基于伪 3D 的时空交互残差网络的视频动作识别。
Sensors (Basel). 2020 Jun 1;20(11):3126. doi: 10.3390/s20113126.
4
Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕
IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.
5
Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework.基于级联双注意力卷积神经网络和双向门控循环单元框架的人类活动识别
J Imaging. 2023 Jun 26;9(7):130. doi: 10.3390/jimaging9070130.
6
Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation via Symmetric Adversarial Learning.利用图像进行视频识别:通过对称对抗学习实现异构特征增强
IEEE Trans Image Process. 2019 Nov;28(11):5308-5321. doi: 10.1109/TIP.2019.2917867. Epub 2019 May 24.
7
Deep Manifold Learning Combined With Convolutional Neural Networks for Action Recognition.基于深度流形学习与卷积神经网络的动作识别。
IEEE Trans Neural Netw Learn Syst. 2018 Sep;29(9):3938-3952. doi: 10.1109/TNNLS.2017.2740318. Epub 2017 Sep 15.
8
Video Super-Resolution via Bidirectional Recurrent Convolutional Networks.基于双向循环卷积网络的视频超分辨率重建
IEEE Trans Pattern Anal Mach Intell. 2018 Apr;40(4):1015-1028. doi: 10.1109/TPAMI.2017.2701380. Epub 2017 May 4.
9
Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition.自适应去趋势以加速上下文视频识别的卷积门控循环单元训练。
Neural Netw. 2018 Sep;105:356-370. doi: 10.1016/j.neunet.2018.05.009. Epub 2018 May 22.
10
Self-Supervised Learning to Detect Key Frames in Videos.自监督学习在视频关键帧检测中的应用
Sensors (Basel). 2020 Dec 4;20(23):6941. doi: 10.3390/s20236941.

引用本文的文献

1
Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization.基于注意力的时空VLAD网络与自适应视频序列优化的动作识别
Sci Rep. 2024 Oct 31;14(1):26202. doi: 10.1038/s41598-024-75640-6.
2
Intelligent Video Analytics for Human Action Recognition: The State of Knowledge.智能视频分析在人类动作识别中的应用:研究现状。
Sensors (Basel). 2023 Apr 25;23(9):4258. doi: 10.3390/s23094258.
3
Medical Image Captioning Using Optimized Deep Learning Model.基于优化深度学习模型的医学影像字幕生成。
Comput Intell Neurosci. 2022 Mar 9;2022:9638438. doi: 10.1155/2022/9638438. eCollection 2022.
4
Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition.用于人体动作识别的动态时空词袋(D-STBoE)模型。
Sensors (Basel). 2019 Jun 21;19(12):2790. doi: 10.3390/s19122790.
5
Feature Fusion of Deep Spatial Features and Handcrafted Spatiotemporal Features for Human Action Recognition.深度空间特征与手工制作时空特征的融合用于人体动作识别。
Sensors (Basel). 2019 Apr 2;19(7):1599. doi: 10.3390/s19071599.
6
Two-Way Affective Modeling for Hidden Movie Highlights' Extraction.双向情感建模在隐藏电影高光片段提取中的应用。
Sensors (Basel). 2018 Dec 3;18(12):4241. doi: 10.3390/s18124241.
7
Exploring the Consequences of Crowd Compression Through Physics-Based Simulation.通过基于物理的模拟探索人群压缩的后果。
Sensors (Basel). 2018 Nov 27;18(12):4149. doi: 10.3390/s18124149.
8
OPTICS-based Unsupervised Method for Flaking Degree Evaluation on the Murals in Mogao Grottoes.基于光学的莫高窟壁画剥落程度评价的无监督方法。
Sci Rep. 2018 Oct 29;8(1):15954. doi: 10.1038/s41598-018-34317-7.