基于对数对齐一致性和自适应负学习的半监督动作识别

Semi-supervised action recognition using logit aligned consistency and adaptive negative learning.

作者信息

Zuo Fengyun, Xu Yang, Wang Minggang

机构信息

College of Big Data and Information Engineering, Guizhou University, Guiyang, 550025, China.

Zunyi Aluminum Stock Corporation Ltd, Zunyi, 563100, China.

出版信息

Sci Rep. 2025 May 30;15(1):19064. doi: 10.1038/s41598-025-01922-2.

DOI:10.1038/s41598-025-01922-2

PMID:40447647

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12125368/

Abstract

With the development of the socialized video era, while semi-supervised action recognition can address the increasingly high costs of video annotation, it still faces significant challenges, particularly in the underexplored application of Vision Transformer. In this paper, we present our work on designing Full-SVFormer, a simple yet efficient semi-supervised action recognition architecture based on the Transformer framework. Full-SVFormer uses TimeSformer based on pre-trained weights as its backbone network, which balances the accuracy and speed of Transformers in semi-supervised action recognition. Within the stable pseudo-label framework EMA-Teacher, we introduce KL divergence loss, which has undergone logit standardization preprocessing, as its unsupervised consistency loss. This improves the student's focus on the inherent relationship between the logit of student and teacher. Furthermore, we incorporate the Adaptive Negative Learning (ANL) method to introduce additional negative pseudo-labels, which dynamically evaluate the Top-k performance of the model to adaptively assign negative labels, thus making better use of ambiguous prediction examples. We conducted a number of experiments on two extensive datasets, UCF-101 and HMDB-51, where our overall experimental results achieved superior performance compared to previous methods. Our work further advances the development of Transformer in the domain of semi-supervised action recognition.

摘要

随着社会化视频时代的发展，虽然半监督动作识别可以解决视频标注成本日益高昂的问题，但它仍然面临重大挑战，尤其是在视觉Transformer尚未充分探索的应用方面。在本文中，我们展示了关于设计Full-SVFormer的工作，这是一种基于Transformer框架的简单而高效的半监督动作识别架构。Full-SVFormer使用基于预训练权重的TimeSformer作为其骨干网络，该网络在半监督动作识别中平衡了Transformer的准确性和速度。在稳定的伪标签框架EMA-Teacher内，我们引入经过logit标准化预处理的KL散度损失作为其无监督一致性损失。这提高了学生模型对自身与教师模型logit之间内在关系的关注。此外，我们纳入了自适应负学习（ANL）方法以引入额外的负伪标签，该方法动态评估模型的Top-k性能以自适应地分配负标签，从而更好地利用模糊的预测示例。我们在两个广泛使用的数据集UCF-101和HMDB-51上进行了大量实验，我们的整体实验结果与先前方法相比取得了卓越的性能。我们的工作进一步推动了Transformer在半监督动作识别领域的发展。