Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an 710049, China.
HERE Technologies, Chicago, IL 60606, USA.
Sensors (Basel). 2018 Jun 21;18(7):1979. doi: 10.3390/s18071979.
Research in human action recognition has accelerated significantly since the introduction of powerful machine learning tools such as Convolutional Neural Networks (CNNs). However, effective and efficient methods for incorporation of temporal information into CNNs are still being actively explored in the recent literature. Motivated by the popular recurrent attention models in the research area of natural language processing, we propose the Attention-aware Temporal Weighted CNN (ATW CNN) for action recognition in videos, which embeds a visual attention model into a temporal weighted multi-stream CNN. This attention model is simply implemented as temporal weighting yet it effectively boosts the recognition performance of video representations. Besides, each stream in the proposed ATW CNN framework is capable of end-to-end training, with both network parameters and temporal weights optimized by stochastic gradient descent (SGD) with back-propagation. Our experimental results on the UCF-101 and HMDB-51 datasets showed that the proposed attention mechanism contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.
自卷积神经网络(CNN)等强大的机器学习工具问世以来,人类动作识别的研究取得了显著进展。然而,在最近的文献中,仍在积极探索将时间信息有效且高效地融入 CNN 的方法。受自然语言处理领域中流行的递归注意模型的启发,我们提出了一种用于视频动作识别的注意力感知时间加权 CNN(ATW CNN),它将视觉注意力模型嵌入到时间加权多流 CNN 中。这种注意力模型通过简单的时间加权来实现,但可以有效提高视频表示的识别性能。此外,所提出的 ATW CNN 框架中的每个流都可以进行端到端训练,通过随机梯度下降(SGD)和反向传播来优化网络参数和时间权重。我们在 UCF-101 和 HMDB-51 数据集上的实验结果表明,所提出的注意力机制通过关注更相关的视频片段,通过关注更具判别力的片段,为性能提升做出了重大贡献。