• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视频问答的基于Transformer的不变接地

Transformer-Empowered Invariant Grounding for Video Question Answering.

作者信息

Li Yicong, Wang Xiang, Xiao Junbin, Ji Wei, Chua Tat-Seng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.

DOI:10.1109/TPAMI.2023.3303451
PMID:37556333
Abstract

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

摘要

视频问答(VideoQA)是回答关于视频的问题的任务。其核心在于理解视频场景与问题语义之间的对齐关系以得出答案。在领先的视频问答模型中,典型的学习目标,即经验风险最小化(ERM),倾向于过度利用与问题无关的场景和答案之间的虚假相关性,而不是考察关键问题场景的因果效应,这会以不可靠的推理破坏预测。在这项工作中,我们从因果关系的角度审视视频问答,并提出了一个模态无关的学习框架,名为视频问答的不变基础(IGV),以确定关键问题场景,其与答案的因果关系在对补充内容的不同干预下是不变的。有了IGV,领先的视频问答模型被迫避免虚假相关性的负面影响来进行回答,这显著提高了它们的推理能力。为了释放这个框架的潜力,我们进一步提供了一个基于Transformer的视频问答不变基础(TIGV),这是IGV框架的一个重要实例,它自然地将不变基础的思想整合到一个Transformer风格的主干中。在四个基准数据集上的实验在准确性、视觉可解释性和相对于领先基线的泛化能力方面验证了我们的设计。我们的代码可在https://github.com/yl3800/TIGV获取。

相似文献

1
Transformer-Empowered Invariant Grounding for Video Question Answering.用于视频问答的基于Transformer的不变接地
IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.
2
A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.一种用于视频问答的多尺度自监督超图对比学习框架。
Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.
3
Learning to Answer Visual Questions From Web Videos.学习从网络视频中回答视觉问题。
IEEE Trans Pattern Anal Mach Intell. 2025 May;47(5):3202-3218. doi: 10.1109/TPAMI.2022.3173208. Epub 2025 Apr 8.
4
Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition.基于自监督事件识别的视频问答动态时空图推理
IEEE Trans Image Process. 2024;33:4145-4158. doi: 10.1109/TIP.2024.3411448. Epub 2024 Jul 9.
5
Contrastive Video Question Answering via Video Graph Transformer.通过视频图变换器实现对比视频问答
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13265-13280. doi: 10.1109/TPAMI.2023.3292266. Epub 2023 Oct 3.
6
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering.用于事件级视觉问答的跨模态因果关系推理
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11624-11641. doi: 10.1109/TPAMI.2023.3284038. Epub 2023 Sep 5.
7
Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.用于视频问答的双流融合组合注意力网络。
IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.
8
Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.将神经符号推理与变分因果推理网络相结合用于可解释视觉问答
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):7893-7908. doi: 10.1109/TPAMI.2024.3398012. Epub 2024 Nov 6.
9
Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering.用于端到端长期视频问答的多粒度对比跨模态协作生成
IEEE Trans Image Process. 2024;33:3115-3129. doi: 10.1109/TIP.2024.3390984. Epub 2024 Apr 30.
10
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering.用于视频问答的事件图引导的组合式时空推理
IEEE Trans Image Process. 2024;33:1109-1121. doi: 10.1109/TIP.2024.3358726. Epub 2024 Feb 5.