• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于事件级视觉问答的跨模态因果关系推理

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering.

作者信息

Liu Yang, Li Guanbin, Lin Liang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11624-11641. doi: 10.1109/TPAMI.2023.3284038. Epub 2023 Sep 5.

DOI:10.1109/TPAMI.2023.3284038
PMID:37289602
Abstract

Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.

摘要

现有的视觉问答方法常常受到跨模态虚假相关性以及过于简化的事件级推理过程的困扰,这些推理过程无法捕捉视频中跨越的事件时间性、因果关系和动态变化。在这项工作中,为了解决事件级视觉问答任务,我们提出了一个跨模态因果关系推理框架。具体而言,引入了一组因果干预操作,以发现视觉和语言模态之间潜在的因果结构。我们的框架名为跨模态因果关系推理(CMCIR),包括三个模块:i)因果感知视觉-语言推理(CVLR)模块,用于通过前门和后门因果干预协同解开视觉和语言的虚假相关性;ii)时空变换器(STT)模块,用于捕捉视觉和语言语义之间的细粒度交互;iii)视觉-语言特征融合(VLFF)模块,用于自适应地学习全局语义感知的视觉-语言表示。在四个事件级数据集上进行的大量实验证明了我们的CMCIR在发现视觉-语言因果结构以及实现稳健的事件级视觉问答方面的优越性。

相似文献

1
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering.用于事件级视觉问答的跨模态因果关系推理
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11624-11641. doi: 10.1109/TPAMI.2023.3284038. Epub 2023 Sep 5.
2
An effective spatial relational reasoning networks for visual question answering.用于视觉问答的有效的空间关系推理网络。
PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693. eCollection 2022.
3
Transformer-Empowered Invariant Grounding for Video Question Answering.用于视频问答的基于Transformer的不变接地
IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.
4
Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering.用于视频问答的交叉注意力时空语义图网络
IEEE Trans Image Process. 2022;31:1684-1696. doi: 10.1109/TIP.2022.3142526. Epub 2022 Feb 3.
5
Contrastive Video Question Answering via Video Graph Transformer.通过视频图变换器实现对比视频问答
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13265-13280. doi: 10.1109/TPAMI.2023.3292266. Epub 2023 Oct 3.
6
Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.将神经符号推理与变分因果推理网络相结合用于可解释视觉问答
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):7893-7908. doi: 10.1109/TPAMI.2024.3398012. Epub 2024 Nov 6.
7
DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering.DisAVR:用于图表问答的解缠自适应视觉推理网络
IEEE Trans Image Process. 2023;32:4812-4827. doi: 10.1109/TIP.2023.3306910. Epub 2023 Aug 29.
8
Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling.基于文本感知跨模态对比解缠的多粒度视觉枢轴引导多模态神经机器翻译
Neural Netw. 2024 Oct;178:106403. doi: 10.1016/j.neunet.2024.106403. Epub 2024 May 23.
9
Interpretable Visual Question Answering by Reasoning on Dependency Trees.基于依存树推理的可解释视觉问答。
IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):887-901. doi: 10.1109/TPAMI.2019.2943456. Epub 2021 Feb 4.
10
Toward Accurate Visual Reasoning With Dual-Path Neural Module Networks.迈向基于双路径神经模块网络的精确视觉推理
Front Robot AI. 2020 Aug 21;7:109. doi: 10.3389/frobt.2020.00109. eCollection 2020.

引用本文的文献

1
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching.用于图像-文本匹配的新型跨维度粗细粒度互补网络。
PeerJ Comput Sci. 2025 Mar 3;11:e2725. doi: 10.7717/peerj-cs.2725. eCollection 2025.
2
Causal Inference Meets Deep Learning: A Comprehensive Survey.因果推断与深度学习:全面综述
Research (Wash D C). 2024 Sep 10;7:0467. doi: 10.34133/research.0467. eCollection 2024.
3
Cross-Modal Graph Contrastive Learning with Cellular Images.基于细胞图像的跨模态图对比学习。
Adv Sci (Weinh). 2024 Aug;11(32):e2404845. doi: 10.1002/advs.202404845. Epub 2024 Jun 21.
4
Evaluation of student failure in higher education by an innovative strategy of fuzzy system combined optimization algorithms and AI.通过模糊系统、组合优化算法和人工智能的创新策略评估高等教育中的学生不及格情况。
Heliyon. 2024 Apr 3;10(7):e29182. doi: 10.1016/j.heliyon.2024.e29182. eCollection 2024 Apr 15.
5
q-Rung orthopair fuzzy dynamic aggregation operators with time sequence preference for dynamic decision-making.具有时间序列偏好的q-阶正交对模糊动态聚合算子用于动态决策
PeerJ Comput Sci. 2024 Jan 31;10:e1742. doi: 10.7717/peerj-cs.1742. eCollection 2024.
6
Text-Guided Image Editing Based on Post Score for Gaining Attention on Social Media.基于后得分的文本引导图像编辑,以在社交媒体上获得关注。
Sensors (Basel). 2024 Jan 31;24(3):921. doi: 10.3390/s24030921.