Liu Chang, Yao Yuan, Luo Dezhao, Zhou Yu, Ye Qixiang
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):9832-9846. doi: 10.1109/TNNLS.2022.3160860. Epub 2023 Nov 30.
In this study, we propose a novel pretext task and a self-supervised motion perception (SMP) method for spatiotemporal representation learning. The pretext task is defined as video playback rate perception, which utilizes temporal dilated sampling to augment video clips to multiple duplicates of different temporal resolutions. The SMP method is built upon discriminative and generative motion perception models, which capture representations related to motion dynamics and appearance from video clips of multiple temporal resolutions in a collaborative fashion. To enhance the collaboration, we further propose difference and convolution motion attention (MA), which drives the generative model focusing on motion-related appearance, and leverage multiple granularity perception (MG) to extract accurate motion dynamics. Extensive experiments demonstrate SMP's effectiveness for video motion perception and state-of-the-art performance of self-supervised representation models upon target tasks, including action recognition and video retrieval. Code for SMP is available at github.com/yuanyao366/SMP.
在本研究中,我们提出了一种用于时空表征学习的新型预训练任务和一种自监督运动感知(SMP)方法。该预训练任务被定义为视频播放速率感知,它利用时间膨胀采样将视频片段扩充为具有不同时间分辨率的多个副本。SMP方法基于判别式和生成式运动感知模型构建,这些模型以协作方式从多个时间分辨率的视频片段中捕获与运动动态和外观相关的表征。为了增强协作,我们进一步提出了差分与卷积运动注意力(MA),它驱动生成模型关注与运动相关的外观,并利用多粒度感知(MG)来提取准确的运动动态。大量实验证明了SMP在视频运动感知方面的有效性,以及自监督表征模型在包括动作识别和视频检索等目标任务上的领先性能。SMP的代码可在github.com/yuanyao366/SMP获取。