Wang Wei, Gao Junyu, Xu Changsheng
IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3933-3948. doi: 10.1109/TPAMI.2022.3180025. Epub 2023 Feb 3.
We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning. Despite the recent progress, existing methods all suffer from the severe problem of spurious association, which will harm the grounding performance. In this paper, we start from the definition of WSVOG and pinpoint the spurious association from two aspects: (1) the association itself is not object-relevant but extremely ambiguous due to weak supervision; and (2) the association is unavoidably confounded by the observational bias when taking the statistics-based matching strategy in existing methods. With this in mind, we design a unified causal framework to learn the deconfounded object-relevant association for more accurate and robust video object grounding. Specifically, we learn the object-relevant association by causal intervention from the perspective of video data generation process. To overcome the problems of lacking fine-grained supervision in terms of intervention, we propose a novel spatial-temporal adversarial contrastive learning paradigm. To further remove the accompanying confounding effect within the object-relevant association, we pursue the true causality by conducting causal intervention via backdoor adjustment. Finally, the deconfounded object-relevant association is learned and optimized under a unified causal framework in an end-to-end manner. Extensive experiments on both IID and OOD testing sets of three benchmarks demonstrate its accurate and robust grounding performance against state-of-the-arts.
我们针对弱监督视频对象定位(WSVOG)任务,在模型学习过程中仅有视频-句子注释可用。其目标是将句子中描述的对象定位到视频中的视觉区域,这是模式分析和机器学习所需的一项基本能力。尽管最近取得了进展,但现有方法都存在严重的虚假关联问题,这会损害定位性能。在本文中,我们从WSVOG的定义出发,从两个方面指出虚假关联:(1)由于弱监督,关联本身与对象无关且极其模糊;(2)在现有方法中采用基于统计的匹配策略时,关联不可避免地受到观测偏差的混淆。考虑到这一点,我们设计了一个统一的因果框架,以学习去混淆的与对象相关的关联,从而实现更准确、更稳健的视频对象定位。具体而言,我们从视频数据生成过程的角度通过因果干预来学习与对象相关的关联。为了克服干预方面缺乏细粒度监督的问题,我们提出了一种新颖的时空对抗对比学习范式。为了进一步消除与对象相关的关联中伴随的混淆效应,我们通过后门调整进行因果干预来追求真正的因果关系。最后,在一个统一的因果框架下以端到端的方式学习和优化去混淆的与对象相关的关联。在三个基准的独立同分布(IID)和分布外(OOD)测试集上进行的大量实验表明,其相对于现有技术具有准确且稳健的定位性能。