Luo Wang, Ren Huan, Zhang Tianzhu, Yang Wenfei, Zhang Yongdong
IEEE Trans Image Process. 2025;34:3154-3168. doi: 10.1109/TIP.2024.3431915.
Weakly-supervised Temporal Action Localization (WTAL) aims to localize action instances with only video-level labels during training, where two primary issues are localization incompleteness and background interference. To relieve these two issues, recent methods adopt an attention mechanism to activate action instances and simultaneously suppress background ones, which have achieved remarkable progress. Nevertheless, we argue that these two issues have not been well resolved yet. On the one hand, the attention mechanism adopts fixed weights for different videos, which are incapable of handling the diversity of different videos, thus deficient in addressing the problem of localization incompleteness. On the other hand, previous methods only focus on learning the foreground attention and the attention weights usually suffer from ambiguity, resulting in difficulty of suppressing background interference. To deal with the above issues, in this paper we propose an Adaptive Prototype Learning (APL) method for WTAL, which includes two key designs: 1) an Adaptive Transformer Network (ATN) to explicitly model background and learn video-adaptive prototypes for each specific video; 2) an OT-based Collaborative (OTC) training strategy to guide the learning of prototypes and remove the ambiguity of the foreground-background separation by introducing an Optimal Transport (OT) algorithm into the collaborative training scheme between RGB and FLOW streams. These two key designs can work together to learn video-adaptive prototypes and solve the above two issues, achieving robust localization. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our proposed APL performs favorably against state-of-the-art methods.
弱监督时间动作定位(WTAL)旨在在训练期间仅使用视频级标签来定位动作实例,其中两个主要问题是定位不完整性和背景干扰。为了缓解这两个问题,最近的方法采用注意力机制来激活动作实例并同时抑制背景实例,取得了显著进展。然而,我们认为这两个问题尚未得到很好的解决。一方面,注意力机制对不同视频采用固定权重,无法处理不同视频的多样性,因此在解决定位不完整性问题上存在不足。另一方面,先前的方法仅专注于学习前景注意力,且注意力权重通常存在模糊性,导致难以抑制背景干扰。为了解决上述问题,在本文中我们提出了一种用于WTAL的自适应原型学习(APL)方法,它包括两个关键设计:1)一个自适应Transformer网络(ATN),用于显式建模背景并为每个特定视频学习视频自适应原型;2)一种基于最优传输(OT)的协同(OTC)训练策略,通过将最优传输(OT)算法引入RGB和FLOW流之间的协同训练方案来指导原型学习并消除前景 - 背景分离的模糊性。这两个关键设计可以共同作用以学习视频自适应原型并解决上述两个问题,实现鲁棒的定位。在两个标准基准(THUMOS14和ActivityNet)上的大量实验结果表明,我们提出的APL方法优于现有方法。