序贯视频 VLAD：局部和时间聚合的训练。

Sequential Video VLAD: Training the Aggregation Locally and Temporally.

出版信息

IEEE Trans Image Process. 2018 Oct;27(10):4933-4944. doi: 10.1109/TIP.2018.2846664.

DOI:10.1109/TIP.2018.2846664

Abstract

As characterizing videos simultaneously from spatial and temporal cues has been shown crucial for the video analysis, the combination of convolutional neural networks and recurrent neural networks, i.e., recurrent convolution networks (RCNs), should be a native framework for learning the spatio-temporal video features. In this paper, we develop a novel sequential vector of locally aggregated descriptor (VLAD) layer, named SeqVLAD, to combine a trainable VLAD encoding process and the RCNs architecture into a whole framework. In particular, sequential convolutional feature maps extracted from successive video frames are fed into the RCNs to learn soft spatio-temporal assignment parameters, so as to aggregate not only detailed spatial information in separate video frames but also fine motion information in successive video frames. Moreover, we improve the gated recurrent unit (GRU) of RCNs by sharing the input-to-hidden parameters and propose an improved GRU-RCN architecture named shared GRU-RCN (SGRU-RCN). Thus, our SGRU-RCN has a fewer parameters and a less possibility of overfitting. In experiments, we evaluate SeqVLAD with the tasks of video captioning and video action recognition. Experimental results on Microsoft Research Video Description Corpus, Montreal Video Annotation Dataset, UCF101, and HMDB51 demonstrate the effectiveness and good performance of our method.

摘要

由于从空间和时间线索同时对视频进行特征描述对于视频分析至关重要，因此卷积神经网络和递归神经网络（即递归卷积网络，RCN）的组合应该是学习时空视频特征的原生框架。在本文中，我们开发了一种新的局部聚合描述符（VLAD）序列层，名为 SeqVLAD，将可训练的 VLAD 编码过程和 RCNs 架构结合到一个整体框架中。具体来说，从连续视频帧中提取的顺序卷积特征图被输入到 RCNs 中，以学习软时空分配参数，从而不仅聚合单独视频帧中的详细空间信息，而且聚合连续视频帧中的精细运动信息。此外，我们通过共享输入到隐藏的参数改进了 RCNs 的门控循环单元（GRU），并提出了一种名为共享 GRU-RCN（SGRU-RCN）的改进 GRU-RCN 架构。因此，我们的 SGRU-RCN 的参数更少，过拟合的可能性更小。在实验中，我们使用视频字幕和视频动作识别任务来评估 SeqVLAD。在 Microsoft Research Video Description Corpus、Montreal Video Annotation Dataset、UCF101 和 HMDB51 上的实验结果证明了我们方法的有效性和良好性能。