Department of Computer Science, Sichuan University, Chengdu 610017, China.
Sensors (Basel). 2022 Sep 1;22(17):6595. doi: 10.3390/s22176595.
As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.
作为视频内容分析的一个子领域,动作识别近年来受到了广泛关注,其目的是识别视频中的人类动作。与单个图像相比,视频具有时间维度。因此,从视频中提取时空信息对于动作识别具有重要意义。在本文中,提出了一种高效的网络来提取具有较低计算负载的时空信息(称为 MEST)。首先,开发了一个运动编码器,用于捕获连续帧之间的短期运动线索,然后是一个通道式时空模块,用于建模长期特征信息。此外,应用权重标准化方法对卷积层和批量归一化层进行标准化,以加速训练过程并促进收敛。在五个公共动作识别数据集上进行了实验,包括 Something-Something-V1 和 -V2、Jester、UCF101 和 HMDB51,MEST 与其他流行方法相比表现出了竞争性能。结果表明,我们的网络在准确性、计算成本和网络规模方面具有有效性。