• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

视觉问答中的跨模态偏差:基于可能世界视觉问答的因果观点

Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.

作者信息

Vosoughi Ali, Deng Shijian, Zhang Songyang, Tian Yapeng, Xu Chenliang, Luo Jiebo

机构信息

Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14620.

Department of Computer Science, University of Texas Dallas, Dallas, TX 12345.

出版信息

IEEE Trans Multimedia. 2024;26:8609-8624. doi: 10.1109/tmm.2024.3380259. Epub 2024 Mar 21.

DOI:10.1109/tmm.2024.3380259
PMID:39429951
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11485245/
Abstract

To increase the generalization capability of VQA systems, many recent studies have tried to de-bias spurious language or vision associations that shortcut the question or image to the answer. Despite these efforts, the literature fails to address the confounding effect of vision and language simultaneously. As a result, when they reduce bias learned from one modality, they usually increase bias from the other. In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect. The model trained in this strategy can concurrently and efficiently reduce vision and language bias. To the best of our knowledge, this is the first work to reduce biases resulting from confounding effects of vision and language in VQA, leveraging causal explain-away relations. We accompany our method with an explain-away strategy, pushing the accuracy of the questions with numerical answers results compared to existing methods that have been an open problem. The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets. R2: Providing brief insights into the experimental setup and results would add valuable context for readers. In response to R2, we released the code and documentation for the implementation as follows. Our codes are available at https://github.com/ali-vosoughi/PW-VQA.

摘要

为了提高视觉问答(VQA)系统的泛化能力,最近许多研究试图消除虚假的语言或视觉关联,这些关联会绕过问题或图像直接给出答案。尽管做出了这些努力,但相关文献未能同时解决视觉和语言的混杂效应。因此,当它们减少从一种模态中学到的偏差时,通常会增加另一种模态的偏差。在本文中,我们首先对同时导致语言和视觉偏差的混杂效应进行建模,然后提出一种反事实推理来消除这种效应的影响。用这种策略训练的模型可以同时有效地减少视觉和语言偏差。据我们所知,这是第一项利用因果消除关系减少VQA中视觉和语言混杂效应所导致偏差的工作。我们用一种消除策略来配合我们的方法,与一直以来作为一个开放问题的现有方法相比,在有数值答案的问题上提高了准确率。所提出的方法在VQA-CP v2数据集上优于现有最先进的方法。回复2:简要介绍实验设置和结果将为读者增添有价值的背景信息。针对回复2,我们如下发布了实现代码和文档。我们的代码可在https://github.com/ali-vosoughi/PW-VQA获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/6a962848a46e/nihms-2018301-f0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/b53279002460/nihms-2018301-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/e5055712c97a/nihms-2018301-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/50f886deeb84/nihms-2018301-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/18cf8746e69c/nihms-2018301-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/5a26dd367628/nihms-2018301-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/47a92d15cf5f/nihms-2018301-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/d927ffd9a041/nihms-2018301-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/77534dd0fa57/nihms-2018301-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/a7431d5bb807/nihms-2018301-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/f19263ee7e81/nihms-2018301-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/46eed1fb14c6/nihms-2018301-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/82fd06d9b8ed/nihms-2018301-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/6a962848a46e/nihms-2018301-f0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/b53279002460/nihms-2018301-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/e5055712c97a/nihms-2018301-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/50f886deeb84/nihms-2018301-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/18cf8746e69c/nihms-2018301-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/5a26dd367628/nihms-2018301-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/47a92d15cf5f/nihms-2018301-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/d927ffd9a041/nihms-2018301-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/77534dd0fa57/nihms-2018301-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/a7431d5bb807/nihms-2018301-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/f19263ee7e81/nihms-2018301-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/46eed1fb14c6/nihms-2018301-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/82fd06d9b8ed/nihms-2018301-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4760/11485245/6a962848a46e/nihms-2018301-f0013.jpg

相似文献

1
Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.视觉问答中的跨模态偏差:基于可能世界视觉问答的因果观点
IEEE Trans Multimedia. 2024;26:8609-8624. doi: 10.1109/tmm.2024.3380259. Epub 2024 Mar 21.
2
Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering.反事实双偏差视觉问答:一种用于稳健视觉问答的多模态去偏差学习方法
IEEE Trans Neural Netw Learn Syst. 2025 Sep;36(9):16366-16378. doi: 10.1109/TNNLS.2025.3562085.
3
Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.用于医学视觉问答的多目标跨模态自监督视觉语言预训练
J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.
4
Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering.用于减轻视觉问答中语言偏差的协作模态融合
J Imaging. 2024 Feb 23;10(3):56. doi: 10.3390/jimaging10030056.
5
Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering.用于视觉问答的基于丰富视觉知识的增强网络
IEEE Trans Neural Netw Learn Syst. 2021 Oct;32(10):4362-4373. doi: 10.1109/TNNLS.2020.3017530. Epub 2021 Oct 5.
6
Advancing surgical VQA with scene graph knowledge.利用场景图知识推进外科视觉问答。
Int J Comput Assist Radiol Surg. 2024 Jul;19(7):1409-1417. doi: 10.1007/s11548-024-03141-y. Epub 2024 May 23.
7
Reducing Vision-Answer Biases for Multiple-Choice VQA.减少多项选择题视觉问答中的视觉答案偏差
IEEE Trans Image Process. 2023;32:4621-4634. doi: 10.1109/TIP.2023.3302162. Epub 2023 Aug 16.
8
Robust visual question answering via polarity enhancement and contrast.通过极性增强和对比实现鲁棒的视觉问答。
Neural Netw. 2024 Nov;179:106560. doi: 10.1016/j.neunet.2024.106560. Epub 2024 Jul 20.
9
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.
10
Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering.用于可解释医学视觉问答的反事实因果效应干预
IEEE Trans Med Imaging. 2024 Dec;43(12):4430-4441. doi: 10.1109/TMI.2024.3425533. Epub 2024 Dec 2.

引用本文的文献

1
TBKIN: Threshold-based explicit selection for enhanced cross-modal semantic alignments.TBKIN:基于阈值的显式选择以增强跨模态语义对齐
PLoS One. 2025 Jun 10;20(6):e0325543. doi: 10.1371/journal.pone.0325543. eCollection 2025.

本文引用的文献

1
Learning Resolution-Adaptive Representations for Cross-Resolution Person Re-Identification.用于跨分辨率行人重识别的学习分辨率自适应表示
IEEE Trans Image Process. 2023;32:4800-4811. doi: 10.1109/TIP.2023.3305817. Epub 2023 Aug 29.
2
Reducing Vision-Answer Biases for Multiple-Choice VQA.减少多项选择题视觉问答中的视觉答案偏差
IEEE Trans Image Process. 2023;32:4621-4634. doi: 10.1109/TIP.2023.3302162. Epub 2023 Aug 16.
3
Deconfounded Image Captioning: A Causal Retrospect.去混淆图像字幕:因果回顾
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12996-13010. doi: 10.1109/TPAMI.2021.3121705. Epub 2023 Oct 3.
4
Effects of Language on Visual Perception.语言对视觉感知的影响。
Trends Cogn Sci. 2020 Nov;24(11):930-944. doi: 10.1016/j.tics.2020.08.005. Epub 2020 Oct 1.