Cao Haozhi, Xu Yuecong, Mao Kezhi, Xie Lihua, Yin Jianxiong, See Simon, Xu Qianwen, Yang Jianfei
IEEE Trans Cybern. 2024 Jun;54(6):3810-3822. doi: 10.1109/TCYB.2023.3265393. Epub 2024 May 30.
This article introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It stems from the observation that the visual system of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, we construct the incoherent clip by multiple subclips hierarchically sampled from the same raw video with various lengths of incoherence. The network is trained to learn the high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, we introduce intravideo contrastive learning to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval using various backbone networks. Experiments show that our proposed method achieves remarkable performance across different backbone networks and different datasets compared to previous coherence-based methods.
本文介绍了一种新颖的自监督方法,该方法利用非相干检测进行视频表示学习。它源于这样一种观察,即人类视觉系统能够基于对视频的全面理解轻松识别视频中的非相干性。具体而言,我们通过从同一原始视频中分层采样多个具有不同非相干长度的子剪辑来构建非相干剪辑。网络通过将非相干剪辑作为输入预测非相干的位置和长度来学习高级表示。此外,我们引入了视频内对比学习,以最大化来自同一原始视频的非相干剪辑之间的互信息。我们使用各种骨干网络在动作识别和视频检索方面进行了广泛的实验来评估我们提出的方法。实验表明,与以前基于相干性的方法相比,我们提出的方法在不同的骨干网络和不同的数据集上都取得了显著的性能。