Suppr超能文献

用于弱监督时间动作定位的语义和时间上下文关联学习

Semantic and Temporal Contextual Correlation Learning for Weakly-Supervised Temporal Action Localization.

作者信息

Fu Jie, Gao Junyu, Xu Changsheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12427-12443. doi: 10.1109/TPAMI.2023.3287208. Epub 2023 Sep 5.

Abstract

Weakly-supervised temporal action localization (WSTAL) aims to automatically identify and localize action instances in untrimmed videos with only video-level labels as supervision. In this task, there exist two challenges: (1) how to accurately discover the action categories in an untrimmed video (what to discover); (2) how to elaborately focus on the integral temporal interval of each action instance (where to focus). Empirically, to discover the action categories, discriminative semantic information should be extracted, while robust temporal contextual information is beneficial for complete action localization. However, most existing WSTAL methods ignore to explicitly and jointly model the semantic and temporal contextual correlation information for the above two challenges. In this article, a Semantic and Temporal Contextual Correlation Learning Network (STCL-Net) with the semantic (SCL) and temporal contextual correlation learning (TCL) modules is proposed, which achieves both accurate action discovery and complete action localization by modeling the semantic and temporal contextual correlation information for each snippet in the inter- and intra-video manners respectively. It is noteworthy that the two proposed modules are both designed in a unified dynamic correlation-embedding paradigm. Extensive experiments are performed on different benchmarks. On all the benchmarks, our proposed method exhibits superior or comparable performance in comparison to the existing state-of-the-art models, especially achieving gains as high as 7.2% in terms of the average mAP on THUMOS-14. In addition, comprehensive ablation studies also verify the effectiveness and robustness of each component in our model.

摘要

弱监督时间动作定位(WSTAL)旨在仅以视频级标签作为监督,在未修剪的视频中自动识别和定位动作实例。在这项任务中,存在两个挑战:(1)如何在未修剪的视频中准确发现动作类别(发现什么);(2)如何精心聚焦于每个动作实例的完整时间间隔(聚焦何处)。根据经验,为了发现动作类别,应提取有区分性的语义信息,而强大的时间上下文信息有助于完整的动作定位。然而,大多数现有的WSTAL方法都忽略了针对上述两个挑战,显式地联合建模语义和时间上下文相关信息。在本文中,提出了一种具有语义(SCL)和时间上下文相关学习(TCL)模块的语义和时间上下文相关学习网络(STCL-Net),该网络通过分别以视频间和视频内的方式为每个片段建模语义和时间上下文相关信息,实现了准确的动作发现和完整的动作定位。值得注意的是,所提出的两个模块均采用统一的动态相关嵌入范式设计。在不同的基准上进行了广泛的实验。在所有基准上,与现有的最先进模型相比,我们提出的方法表现出优异或相当的性能,特别是在THUMOS-14上,平均mAP方面实现了高达7.2%的提升。此外,全面的消融研究也验证了我们模型中每个组件的有效性和鲁棒性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验