IEEE Trans Neural Netw Learn Syst. 2021 Jan;32(1):334-347. doi: 10.1109/TNNLS.2020.2978613. Epub 2021 Jan 4.
Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.
卷积神经网络 (CNN) 已被证明是一种有效的方法,可以学习视频中动作识别的时空表示。然而,大多数传统的动作识别算法并没有利用注意力机制来关注与动作相关的视频帧的重要部分。在本文中,我们提出了一种新的全局和局部知识感知注意力网络,以解决动作识别中的这一挑战。所提出的网络结合了两种类型的注意力机制,称为基于统计的注意力 (SA) 和基于学习的注意力 (LA),以对每个视频帧中的关键元素赋予更高的重要性。由于全局池化 (GP) 模型捕获全局信息,而注意力模型专注于重要细节,以充分利用其隐含的互补优势,我们的网络采用了三流架构,包括两个注意力流和一个 GP 流。每个注意力流都采用融合层来结合全局和局部信息,并生成组合特征。此外,提出了全局注意力 (GA) 正则化,以引导两个注意力流更好地以全局信息为参考来建模组合特征的动态。在 softmax 层进行融合,以更好地利用 SA、LA 和 GP 流之间的隐含互补优势,并获得最终的综合预测。所提出的网络以端到端的方式进行训练,学习有效的视频级特征,包括空间和时间特征。在三个具有挑战性的基准 Kinetics、HMDB51 和 UCF101 上进行了广泛的实验,实验结果表明,所提出的网络优于大多数最先进的方法。