IEEE Trans Image Process. 2023;32:2215-2227. doi: 10.1109/TIP.2023.3265261. Epub 2023 Apr 18.
Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.
半监督学习在图像分类领域已经得到了很好的发展,但在基于视频的动作识别中仍有待探索。FixMatch 是一种用于图像分类的最新半监督方法,但它直接应用于视频领域时效果不佳,因为它仅利用了单一的 RGB 模态,其中包含的运动信息不足。此外,它仅利用高度置信的伪标签来探索强增强和弱增强样本之间的一致性,导致监督信号有限、训练时间长且特征可辨别性不足。为了解决上述问题,我们提出了基于邻域引导的一致性和对比学习(NCCL),它同时使用 RGB 和时间梯度(TG)作为输入,并基于教师-学生框架。由于标记样本的限制,我们首先将邻居信息纳入自监督信号中,以探索一致性,从而弥补了 FixMatch 中监督信号不足和训练时间长的缺点。为了学习更具辨别力的特征表示,我们进一步提出了一种新颖的基于邻域引导的类别级对比学习项,以最小化类内距离并扩大类间距离。我们在四个数据集上进行了广泛的实验验证,结果表明我们提出的 NCCL 方法具有更好的性能,且计算成本更低。