通过因果干预实现弱监督视频对象定位

Weakly-Supervised Video Object Grounding via Causal Intervention.

作者信息

Wang Wei, Gao Junyu, Xu Changsheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3933-3948. doi: 10.1109/TPAMI.2022.3180025. Epub 2023 Feb 3.

DOI:10.1109/TPAMI.2022.3180025

Abstract

We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning. Despite the recent progress, existing methods all suffer from the severe problem of spurious association, which will harm the grounding performance. In this paper, we start from the definition of WSVOG and pinpoint the spurious association from two aspects: (1) the association itself is not object-relevant but extremely ambiguous due to weak supervision; and (2) the association is unavoidably confounded by the observational bias when taking the statistics-based matching strategy in existing methods. With this in mind, we design a unified causal framework to learn the deconfounded object-relevant association for more accurate and robust video object grounding. Specifically, we learn the object-relevant association by causal intervention from the perspective of video data generation process. To overcome the problems of lacking fine-grained supervision in terms of intervention, we propose a novel spatial-temporal adversarial contrastive learning paradigm. To further remove the accompanying confounding effect within the object-relevant association, we pursue the true causality by conducting causal intervention via backdoor adjustment. Finally, the deconfounded object-relevant association is learned and optimized under a unified causal framework in an end-to-end manner. Extensive experiments on both IID and OOD testing sets of three benchmarks demonstrate its accurate and robust grounding performance against state-of-the-arts.

摘要

我们针对弱监督视频对象定位（WSVOG）任务，在模型学习过程中仅有视频-句子注释可用。其目标是将句子中描述的对象定位到视频中的视觉区域，这是模式分析和机器学习所需的一项基本能力。尽管最近取得了进展，但现有方法都存在严重的虚假关联问题，这会损害定位性能。在本文中，我们从WSVOG的定义出发，从两个方面指出虚假关联：（1）由于弱监督，关联本身与对象无关且极其模糊；（2）在现有方法中采用基于统计的匹配策略时，关联不可避免地受到观测偏差的混淆。考虑到这一点，我们设计了一个统一的因果框架，以学习去混淆的与对象相关的关联，从而实现更准确、更稳健的视频对象定位。具体而言，我们从视频数据生成过程的角度通过因果干预来学习与对象相关的关联。为了克服干预方面缺乏细粒度监督的问题，我们提出了一种新颖的时空对抗对比学习范式。为了进一步消除与对象相关的关联中伴随的混淆效应，我们通过后门调整进行因果干预来追求真正的因果关系。最后，在一个统一的因果框架下以端到端的方式学习和优化去混淆的与对象相关的关联。在三个基准的独立同分布（IID）和分布外（OOD）测试集上进行的大量实验表明，其相对于现有技术具有准确且稳健的定位性能。

相似文献

Weakly-Supervised Video Object Grounding via Causal Intervention.通过因果干预实现弱监督视频对象定位

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3933-3948. doi: 10.1109/TPAMI.2022.3180025. Epub 2023 Feb 3.

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.用于弱监督时间句子定位的局部对应网络

IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.

From Discriminant to Complete: Reinforcement Searching-Agent Learning for Weakly Supervised Object Detection.从判别式到完备式：用于弱监督目标检测的强化搜索智能体学习

IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5549-5560. doi: 10.1109/TNNLS.2020.2969483. Epub 2020 Nov 30.

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.用于视频中时间性句子定位的语义条件动态调制

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.

Diverse Complementary Part Mining for Weakly Supervised Object Localization.用于弱监督目标定位的多样互补部分挖掘

IEEE Trans Image Process. 2022;31:1774-1788. doi: 10.1109/TIP.2022.3145238. Epub 2022 Feb 8.

Progressive Frame-Proposal Mining for Weakly Supervised Video Object Detection.用于弱监督视频目标检测的渐进式帧提议挖掘

IEEE Trans Image Process. 2024;33:1560-1573. doi: 10.1109/TIP.2024.3364536. Epub 2024 Feb 27.

Single-Frame Supervision for Spatio-Temporal Video Grounding.用于时空视频定位的单帧监督

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5177-5191. doi: 10.1109/TPAMI.2024.3415087.

Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations.具有个体和上下文表示的循环一致弱监督视觉定位

IEEE Trans Image Process. 2023;32:5167-5180. doi: 10.1109/TIP.2023.3311917. Epub 2023 Sep 15.

Weakly Supervised Fine-Grained Categorization With Part-Based Image Representation.基于部件的图像表示的弱监督细粒度分类。

IEEE Trans Image Process. 2016 Apr;25(4):1713-25. doi: 10.1109/TIP.2016.2531289. Epub 2016 Feb 18.

Adversarial Transformers for Weakly Supervised Object Localization.用于弱监督目标定位的对抗性变换器

IEEE Trans Image Process. 2022;31:7130-7143. doi: 10.1109/TIP.2022.3220055. Epub 2022 Nov 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。