Liu Ziyi, Wang Le, Zhang Qilin, Tang Wei, Zheng Nanning, Hua Gang
IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):5886-5902. doi: 10.1109/TPAMI.2021.3078798. Epub 2022 Aug 4.
Given only video-level action categorical labels during training, weakly-supervised temporal action localization (WS-TAL) learns to detect action instances and locates their temporal boundaries in untrimmed videos. Compared to its fully supervised counterpart, WS-TAL is more cost-effective in data labeling and thus favorable in practical applications. However, the coarse video-level supervision inevitably incurs ambiguities in action localization, especially in untrimmed videos containing multiple action instances. To overcome this challenge, we observe that significant temporal contrasts among video snippets, e.g., caused by temporal discontinuities and sudden changes, often occur around true action boundaries. This motivates us to introduce a Contrast-based Localization EvaluAtioN Network (CleanNet), whose core is a new temporal action proposal evaluator, which provides fine-grained pseudo supervision by leveraging the temporal contrasts among snippet-level classification predictions. As a result, the uncertainty in locating action instances can be resolved via evaluating their temporal contrast scores. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Besides, we also explore the usage of temporal contrast on temporal action proposal (TAP) generation task, which we believe is the first attempt with the weak supervision setting. Experiments on the THUMOS14, ActivityNet v1.2 and v1.3 datasets validate the efficacy of our method against existing state-of-the-art WS-TAL algorithms.
在训练过程中仅给定视频级别的动作类别标签,弱监督时间动作定位(WS-TAL)旨在学习检测动作实例并在未修剪的视频中定位其时间边界。与完全监督的方法相比,WS-TAL在数据标注方面更具成本效益,因此在实际应用中更具优势。然而,粗略的视频级监督不可避免地会在动作定位中产生模糊性,尤其是在包含多个动作实例的未修剪视频中。为了克服这一挑战,我们观察到视频片段之间显著的时间对比,例如由时间不连续性和突然变化引起的对比,通常发生在真实动作边界周围。这促使我们引入基于对比的定位评估网络(CleanNet),其核心是一个新的时间动作提议评估器,它通过利用片段级分类预测之间的时间对比来提供细粒度的伪监督。结果,通过评估动作实例的时间对比分数,可以解决定位动作实例时的不确定性。此外,新的动作定位模块是CleanNet的一个组成部分,它支持端到端训练。这与许多现有的WS-TAL方法形成对比,在这些方法中动作定位仅仅是一个后处理步骤。此外,我们还探索了时间对比在时间动作提议(TAP)生成任务中的应用,我们认为这是在弱监督设置下的首次尝试。在THUMOS14、ActivityNet v1.2和v1.3数据集上的实验验证了我们的方法相对于现有最先进的WS-TAL算法的有效性。