Yang Hao, Yuan Chunfeng, Zhang Li, Sun Yunda, Hu Weiming, Maybank Stephen J
IEEE Trans Image Process. 2020 Apr 7. doi: 10.1109/TIP.2020.2984904.
Convolutional Neural Networks have achieved excellent successes for object recognition in still images. However, the improvement of Convolutional Neural Networks over the traditional methods for recognizing actions in videos is not so significant, because the raw videos usually have much more redundant or irrelevant information than still images. In this paper, we propose a Spatial-Temporal Attentive Convolutional Neural Network (STA-CNN) which selects the discriminative temporal segments and focuses on the informative spatial regions automatically. The STA-CNN model incorporates a Temporal Attention Mechanism and a Spatial Attention Mechanism into a unified convolutional network to recognize actions in videos. The novel Temporal Attention Mechanism automatically mines the discriminative temporal segments from long and noisy videos. The Spatial Attention Mechanism firstly exploits the instantaneous motion information in optical flow features to locate the motion salient regions and it is then trained by an auxiliary classification loss with a Global Average Pooling layer to focus on the discriminative non-motion regions in the video frame. The STA-CNN model achieves the state-of-the-art performance on two of the most challenging datasets, UCF-101 (95.8%) and HMDB-51 (71.5%).
卷积神经网络在静止图像的目标识别方面取得了卓越的成功。然而,卷积神经网络相较于传统的视频动作识别方法,其改进并不显著,因为原始视频通常比静止图像拥有更多冗余或无关的信息。在本文中,我们提出了一种时空注意力卷积神经网络(STA-CNN),它能自动选择有区分性的时间片段,并聚焦于信息丰富的空间区域。STA-CNN模型将时间注意力机制和空间注意力机制整合到一个统一的卷积网络中,以识别视频中的动作。新颖的时间注意力机制能从长且有噪声的视频中自动挖掘出有区分性的时间片段。空间注意力机制首先利用光流特征中的瞬时运动信息来定位运动显著区域,然后通过带有全局平均池化层的辅助分类损失进行训练,以聚焦于视频帧中有区分性的非运动区域。STA-CNN模型在两个最具挑战性的数据集UCF-101(95.8%)和HMDB-51(71.5%)上达到了当前最优的性能。