Tao Li, Wang Xueting, Yamasaki Toshihiko
IEEE Trans Image Process. 2021;30:9231-9244. doi: 10.1109/TIP.2021.3124156. Epub 2021 Nov 10.
Recently, 3D convolutional networks yield good performance in action recognition. However, an optical flow stream is still needed for motion representation to ensure better performance, whose cost is very high. In this paper, we propose a cheap but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be achieved on the UCF101 and HMDB51 datasets when trained from scratch using ResNet-18-3D. We deeply analyze the effectiveness of this modality compared to normal RGB video clips, and find that better motion features can be extracted using residual frames with 3D ConvNets. Considering that residual frames contain little information of object appearance, we further use a 2D convolutional network to extract appearance features and combine them together to form a two-path solution. In this way, we can achieve better performance than some methods which even used an additional optical flow stream. Moreover, the proposed residual-input path can outperform RGB counterpart on unseen datasets when we apply trained models to video retrieval tasks. Huge improvements can also be obtained when the residual inputs are applied to video-based self-supervised learning methods, revealing better motion representation and generalization ability of our proposal.
最近,3D卷积网络在动作识别方面表现出色。然而,为了确保更好的性能,仍需要光流来表示运动,而这成本非常高。在本文中,我们提出了一种低成本但有效的方法,在3D卷积神经网络中利用残差帧作为输入数据从视频中提取运动特征。通过用残差帧替换传统的堆叠RGB帧,当使用ResNet-18-3D从头开始训练时,在UCF101和HMDB51数据集上,top-1准确率可分别提高35.6%和26.6%。我们深入分析了这种模态与普通RGB视频片段相比的有效性,发现使用3D卷积神经网络可以用残差帧提取更好的运动特征。考虑到残差帧几乎不包含物体外观信息,我们进一步使用2D卷积网络提取外观特征并将它们组合在一起,形成一种双路径解决方案。通过这种方式,我们可以比一些甚至使用额外光流的方法获得更好的性能。此外,当我们将训练好的模型应用于视频检索任务时,所提出的残差输入路径在未见过的数据集上可以优于RGB对应路径。当将残差输入应用于基于视频的自监督学习方法时,也可以获得巨大的改进,这表明我们的方法具有更好的运动表示和泛化能力。