Suppr超能文献

用于弱监督时间动作定位的双分支关系原型网络

Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization.

作者信息

Huang Linjiang, Huang Yan, Ouyang Wanli, Wang Liang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Sep;44(9):5729-5746. doi: 10.1109/TPAMI.2021.3076172. Epub 2022 Aug 4.

Abstract

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. With only video-level category labels, this task should indistinguishably identify the background and action categories frame by frame. However, it is non-trivial to achieve this in untrimmed videos, due to the unconstrained background, complex and multi-label actions. With the observation that these difficulties are mainly brought by the large variations within background and actions, we propose to address these challenges from the perspective of modeling variations. Moreover, it is desired to further reduce the variations, or learn compact features, so as to cast the problem of background identification as rejecting background and alleviate the contradiction between classification and detection. Accordingly, in this paper, we propose a two-branch relational prototypical network. The first branch, namely action-branch, adopts class-wise prototypes and mainly acts as an auxiliary to introduce priori knowledge about label dependencies and be a guide for the second branch. Meanwhile, the second branch, namely sub-branch, starts with multiple prototypes, namely sub-prototypes, to enable a powerful ability of modeling variations. As a further benefit, we elaborately design a multi-label clustering loss based on the sub-prototypes to learn compact features under the multi-label setting. The two branches are associated using the correspondences between two types of prototypes, leading to a special two-stage classifier in the s-branch, on the other hand, the two branches serve as regularization terms to each other, improving the final performance. Ablation studies find that the proposed model is capable of modeling classes with large variations and learning compact features. Extensive experimental evaluations on Thumos14, MultiThumos and ActivityNet datasets demonstrate the effectiveness of the proposed method and superior performance over state-of-the-art approaches.

摘要

作为高级视频理解中的一项具有挑战性的任务,弱监督时间动作定位最近受到了更多关注。仅使用视频级别的类别标签,此任务需要逐帧无差别地识别背景和动作类别。然而,在未修剪的视频中实现这一点并非易事,因为背景不受约束、动作复杂且具有多标签。通过观察发现这些困难主要是由背景和动作中的巨大变化带来的,我们建议从建模变化的角度来应对这些挑战。此外,期望进一步减少变化或学习紧凑特征,以便将背景识别问题转化为拒绝背景,并缓解分类与检测之间的矛盾。因此,在本文中,我们提出了一种双分支关系原型网络。第一个分支,即动作分支,采用类别原型,主要作为辅助来引入关于标签依赖的先验知识,并为第二个分支提供指导。同时,第二个分支,即子分支,从多个原型(即子原型)开始,以具备强大的建模变化能力。作为进一步的优势,我们基于子原型精心设计了一种多标签聚类损失,以在多标签设置下学习紧凑特征。两个分支通过两种原型之间的对应关系相关联,在子分支中形成了一个特殊的两阶段分类器,另一方面,两个分支相互作为正则化项,提高了最终性能。消融研究发现,所提出的模型能够对具有大变化的类别进行建模并学习紧凑特征。在Thumos14、MultiThumos和ActivityNet数据集上进行的广泛实验评估证明了所提出方法的有效性以及相对于现有方法的优越性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验