College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China.
Institute for Intelligent Elderly Care, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China.
Sensors (Basel). 2020 Jun 1;20(11):3126. doi: 10.3390/s20113126.
Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.
动作识别是传感器和计算机视觉领域中的一个重要且具有挑战性的课题。双流卷积神经网络(CNN)和 3D CNN 是视频动作识别的两种主流深度学习架构。为了将它们结合到一个框架中以进一步提高性能,我们提出了一种新的深度网络,名为具有伪 3D(STINP)的时空交互残差网络。STINP 具有三个优点。首先,STINP 由两个基于残差网络(ResNets)构建的分支组成,以同时学习视频的空间和时间信息。其次,STINP 将伪 3D 块集成到残差单元中以构建空间分支,这确保了空间分支不仅可以学习视频中物体和场景的外观特征,还可以捕捉连续帧之间的潜在交互信息。最后,STINP 采用简单但有效的乘法运算来融合空间分支和时间分支,这保证了在训练 STINP 的整个过程中,学习到的空间和时间表示可以相互作用。我们在两个经典的动作识别数据集 UCF101 和 HMDB51 上进行了实验。实验结果表明,我们提出的 STINP 可以为视频识别提供比其他最先进算法更好的性能。