Wang Xiao, Yan Yan, Hu Hai-Miao, Li Bo, Wang Hanzi
IEEE Trans Image Process. 2024;33:1257-1271. doi: 10.1109/TIP.2024.3354104. Epub 2024 Feb 13.
Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
少样本动作识别旨在仅利用每个类别的少量标记样本识别新的未见类别。然而,它仍然受到数据不足的限制,这很容易导致过拟合和泛化能力低的问题。因此,我们提出了一种跨模态对比学习网络(CCLN),它由一个对抗分支和一个对比分支组成,以执行有效的少样本动作识别。在对抗分支中,我们精心设计了一个原型生成对抗网络(PGAN)来获取合成样本以增加训练样本,这可以缓解数据稀缺问题,从而减轻过拟合问题。当训练样本有限时,所获得的视觉特征通常因缺乏判别信息而对于视频理解而言并非最优。为了解决这个问题,在对比分支中,我们提出了一个跨模态对比学习模块(CCLM),以借助语义信息获得样本的判别特征表示,这可以使网络在类级别增强特征学习能力。此外,由于视频包含关键序列和顺序信息,因此我们引入了一个时空增强模块(SEM)来对视频帧内的空间上下文和跨视频帧的时间上下文进行建模。实验结果表明,所提出的CCLN在包括Kinetics、UCF101、HMDB51和SSv2在内的四个具有挑战性的基准测试中优于当前最先进的少样本动作识别方法。