Suppr超能文献

通过利用时域中的多分辨率信息改进弱监督时间动作定位

Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-Resolution Information in Temporal Domain.

作者信息

Su Rui, Xu Dong, Zhou Luping, Ouyang Wanli

出版信息

IEEE Trans Image Process. 2021;30:6659-6672. doi: 10.1109/TIP.2021.3089355. Epub 2021 Jul 26.

Abstract

Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization.

摘要

弱监督时间动作定位是一项具有挑战性的任务,因为在训练过程中只有视频级别的标注可用。为了解决这个问题,我们提出了一种两阶段方法,通过充分利用时间域中的多分辨率信息以及外观(即RGB)和运动(即光流)流之间的互补信息来生成高质量的帧级伪标签。在第一阶段,我们提出了一个初始标签生成(ILG)模块来生成可靠的初始帧级伪标签。具体来说,在这个新提出的模块中,我们利用时间多分辨率一致性和跨流一致性来生成高质量的类别激活序列(CAS),该序列由多个序列组成,每个序列衡量每个视频帧属于一个特定动作类别的可能性。在第二阶段,我们提出了一个渐进式时间标签细化(PTLR)框架来迭代地细化伪标签,其中我们使用一组具有高度置信伪标签的选定帧来逐步训练两个网络,并更好地预测每个帧的动作类别分数。具体来说,在我们新提出的PTLR框架中,两个分别称为Network-OTS和Network-RTS的网络,它们分别用于为原始时间尺度和缩减后的时间尺度生成CAS,被用作两个流(即OTS流和RTS流)来依次细化伪标签。通过这种方式,在伪标签级别交换时间域中的多分辨率信息,并且我们的工作可以通过利用来自另一个网络/流的细化伪标签来帮助改进每个网络/流。在两个基准数据集THUMOS14和ActivityNet v1.3上进行的综合实验证明了我们新提出的方法对于弱监督时间动作定位的有效性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验