Wu Kewei, Luo Wenjie, Xie Zhao, Guo Dan, Zhang Zhao, Hong Richang
IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):4560-4574. doi: 10.1109/TNNLS.2024.3377468. Epub 2025 Feb 28.
Weakly supervised temporal action localization (TAL) aims to localize the action instances in untrimmed videos using only video-level action labels. Without snippet-level labels, this task should be hard to distinguish all snippets with accurate action/background categories. The main difficulties are the large variations brought by the unconstraint background snippets and multiple subactions in action snippets. The existing prototype model focuses on describing snippets by covering them with clusters (defined as prototypes). In this work, we argue that the clustered prototype covering snippets with simple variations still suffers from the misclassification of the snippets with large variations. We propose an ensemble prototype network (EPNet), which ensembles prototypes learned with consensus-aware clustering. The network stacks a consensus prototype learning (CPL) module and an ensemble snippet weight learning (ESWL) module as one stage and extends one stage to multiple stages in an ensemble learning way. The CPL module learns the consensus matrix by estimating the similarity of clustering labels between two successive clustering generations. The consensus matrix optimizes the clustering to learn consensus prototypes, which can predict the snippets with consensus labels. The ESWL module estimates the weights of the misclassified snippets using the snippet-level loss. The weights update the posterior probabilities of the snippets in the clustering to learn prototypes in the next stage. We use multiple stages to learn multiple prototypes, which can cover the snippets with large variations for accurate snippet classification. Extensive experiments show that our method achieves the state-of-the-art weakly supervised TAL methods on two benchmark datasets, that is, THUMOS'14, ActivityNet v1.2, and ActivityNet v1.3 datasets.
弱监督时间动作定位(TAL)旨在仅使用视频级动作标签在未修剪视频中定位动作实例。由于没有片段级标签,这项任务很难将所有具有准确动作/背景类别的片段区分开来。主要困难在于无约束背景片段带来的巨大变化以及动作片段中的多个子动作。现有的原型模型专注于通过用聚类(定义为原型)覆盖片段来描述片段。在这项工作中,我们认为用简单变化覆盖片段的聚类原型仍然存在对具有大变化的片段进行错误分类的问题。我们提出了一种集成原型网络(EPNet),它集成了通过共识感知聚类学习到的原型。该网络将一个共识原型学习(CPL)模块和一个集成片段权重学习(ESWL)模块堆叠为一个阶段,并以集成学习的方式将一个阶段扩展到多个阶段。CPL模块通过估计两个连续聚类代之间聚类标签的相似度来学习共识矩阵。共识矩阵优化聚类以学习共识原型,该原型可以预测具有共识标签的片段。ESWL模块使用片段级损失估计错误分类片段的权重。权重更新聚类中片段的后验概率,以便在下一阶段学习原型。我们使用多个阶段来学习多个原型,这些原型可以覆盖具有大变化的片段以进行准确的片段分类。大量实验表明,我们的方法在两个基准数据集(即THUMOS'14、ActivityNet v1.2和ActivityNet v1.3数据集)上实现了当前最先进的弱监督TAL方法。