Liu Yang, Li Guanbin, Lin Liang
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11624-11641. doi: 10.1109/TPAMI.2023.3284038. Epub 2023 Sep 5.
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.
现有的视觉问答方法常常受到跨模态虚假相关性以及过于简化的事件级推理过程的困扰,这些推理过程无法捕捉视频中跨越的事件时间性、因果关系和动态变化。在这项工作中,为了解决事件级视觉问答任务,我们提出了一个跨模态因果关系推理框架。具体而言,引入了一组因果干预操作,以发现视觉和语言模态之间潜在的因果结构。我们的框架名为跨模态因果关系推理(CMCIR),包括三个模块:i)因果感知视觉-语言推理(CVLR)模块,用于通过前门和后门因果干预协同解开视觉和语言的虚假相关性;ii)时空变换器(STT)模块,用于捕捉视觉和语言语义之间的细粒度交互;iii)视觉-语言特征融合(VLFF)模块,用于自适应地学习全局语义感知的视觉-语言表示。在四个事件级数据集上进行的大量实验证明了我们的CMCIR在发现视觉-语言因果结构以及实现稳健的事件级视觉问答方面的优越性。