用于视频问答的基于Transformer的不变接地

Transformer-Empowered Invariant Grounding for Video Question Answering.

作者信息

Li Yicong, Wang Xiang, Xiao Junbin, Ji Wei, Chua Tat-Seng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.

DOI:10.1109/TPAMI.2023.3303451

Abstract

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

摘要

视频问答（VideoQA）是回答关于视频的问题的任务。其核心在于理解视频场景与问题语义之间的对齐关系以得出答案。在领先的视频问答模型中，典型的学习目标，即经验风险最小化（ERM），倾向于过度利用与问题无关的场景和答案之间的虚假相关性，而不是考察关键问题场景的因果效应，这会以不可靠的推理破坏预测。在这项工作中，我们从因果关系的角度审视视频问答，并提出了一个模态无关的学习框架，名为视频问答的不变基础（IGV），以确定关键问题场景，其与答案的因果关系在对补充内容的不同干预下是不变的。有了IGV，领先的视频问答模型被迫避免虚假相关性的负面影响来进行回答，这显著提高了它们的推理能力。为了释放这个框架的潜力，我们进一步提供了一个基于Transformer的视频问答不变基础（TIGV），这是IGV框架的一个重要实例，它自然地将不变基础的思想整合到一个Transformer风格的主干中。在四个基准数据集上的实验在准确性、视觉可解释性和相对于领先基线的泛化能力方面验证了我们的设计。我们的代码可在https://github.com/yl3800/TIGV获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于视频问答的基于Transformer的不变接地

Transformer-Empowered Invariant Grounding for Video Question Answering.

作者信息

出版信息

相似文献

用于视频问答的基于Transformer的不变接地

Transformer-Empowered Invariant Grounding for Video Question Answering.

作者信息

出版信息

相似文献