School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China.
Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Guangzhou 510006, China.
Sensors (Basel). 2021 Feb 28;21(5):1656. doi: 10.3390/s21051656.
At present, in the field of video-based human action recognition, deep neural networks are mainly divided into two branches: the 2D convolutional neural network (CNN) and 3D CNN. However, 2D CNN's temporal and spatial feature extraction processes are independent of each other, which means that it is easy to ignore the internal connection, affecting the performance of recognition. Although 3D CNN can extract the temporal and spatial features of the video sequence at the same time, the parameters of the 3D model increase exponentially, resulting in the model being difficult to train and transfer. To solve this problem, this article is based on 3D CNN combined with a residual structure and attention mechanism to improve the existing 3D CNN model, and we propose two types of human action recognition models (the Residual 3D Network (R3D) and Attention Residual 3D Network (AR3D)). Firstly, in this article, we propose a shallow feature extraction module and improve the ordinary 3D residual structure, which reduces the parameters and strengthens the extraction of temporal features. Secondly, we explore the application of the attention mechanism in human action recognition and design a 3D spatio-temporal attention mechanism module to strengthen the extraction of global features of human action. Finally, in order to make full use of the residual structure and attention mechanism, an Attention Residual 3D Network (AR3D) is proposed, and its two fusion strategies and corresponding model structure (AR3D_V1, AR3D_V2) are introduced in detail. Experiments show that the fused structure shows different degrees of performance improvement compared to a single structure.
目前,在基于视频的人体动作识别领域,深度神经网络主要分为二维卷积神经网络(2D CNN)和三维卷积神经网络(3D CNN)两种分支。但是,2D CNN 的时空特征提取过程是相互独立的,这意味着容易忽略内部联系,影响识别性能。虽然 3D CNN 可以同时提取视频序列的时空特征,但是 3D 模型的参数呈指数级增长,导致模型难以训练和迁移。为了解决这个问题,本文基于 3D CNN 结合残差结构和注意力机制对现有的 3D CNN 模型进行改进,提出了两种人体动作识别模型(残差 3D 网络(R3D)和注意力残差 3D 网络(AR3D))。首先,在本文中,我们提出了一个浅层特征提取模块,并改进了普通的 3D 残差结构,减少了参数,加强了对时间特征的提取。其次,我们探索了注意力机制在人体动作识别中的应用,设计了一个 3D 时空注意力机制模块,加强了人体动作的全局特征提取。最后,为了充分利用残差结构和注意力机制,提出了注意力残差 3D 网络(AR3D),并详细介绍了其两种融合策略及其对应的模型结构(AR3D_V1、AR3D_V2)。实验表明,融合结构与单一结构相比,性能都有不同程度的提升。