用于减轻视觉问答中语言偏差的协作模态融合

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering.

作者信息

Lu Qiwen, Chen Shengbo, Zhu Xiaoke

机构信息

School of Computer and Information Engineering, Henan University, Kaifeng 475001, China.

出版信息

J Imaging. 2024 Feb 23;10(3):56. doi: 10.3390/jimaging10030056.

DOI:10.3390/jimaging10030056

PMID:38535137

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10971294/

Abstract

Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model's neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.

摘要

语言偏差是视觉问答（VQA）中一个值得关注的问题，在该领域中，模型倾向于依赖问题与答案之间的虚假关联进行预测。这阻碍了模型的有效泛化，导致性能下降。为了解决这种偏差，我们提出了一种新颖的模态融合协作去偏算法（CoD）。在我们的方法中，偏差被视为模型在预测过程中对特定模态信息的忽视。我们采用协作训练方法来促进不同模态之间的相互建模，实现高效的特征融合，并使模型能够充分利用多模态知识进行预测。我们在包括VQA-CP v2、VQA v2和VQA-VS在内的各种数据集上使用不同验证策略进行的实验证明了我们方法的有效性。值得注意的是，使用基本基线模型在VQA-CP v2上的准确率为60.14%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f8c8/10971294/5d8efe6257ec/jimaging-10-00056-g001.jpg

相似文献

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering.用于减轻视觉问答中语言偏差的协作模态融合

J Imaging. 2024 Feb 23;10(3):56. doi: 10.3390/jimaging10030056.

Robust visual question answering via polarity enhancement and contrast.通过极性增强和对比实现鲁棒的视觉问答。

Neural Netw. 2024 Nov;179:106560. doi: 10.1016/j.neunet.2024.106560. Epub 2024 Jul 20.

Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA.视觉问答中的跨模态偏差：基于可能世界视觉问答的因果观点

IEEE Trans Multimedia. 2024;26:8609-8624. doi: 10.1109/tmm.2024.3380259. Epub 2024 Mar 21.

Multi-modal adaptive gated mechanism for visual question answering.多模态自适应门控机制的视觉问答。

PLoS One. 2023 Jun 28;18(6):e0287557. doi: 10.1371/journal.pone.0287557. eCollection 2023.

Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering.用于鲁棒视觉问答的反事实样本合成与训练

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13218-13234. doi: 10.1109/TPAMI.2023.3290012. Epub 2023 Oct 3.

Multitask Learning for Visual Question Answering.用于视觉问答的多任务学习

IEEE Trans Neural Netw Learn Syst. 2023 Mar;34(3):1380-1394. doi: 10.1109/TNNLS.2021.3105284. Epub 2023 Feb 28.

Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.多模态显式稀疏注意力网络的视觉问答。

Sensors (Basel). 2020 Nov 26;20(23):6758. doi: 10.3390/s20236758.

An Effective Dense Co-Attention Networks for Visual Question Answering.一种用于视觉问答的高效密集协同注意力网络。

Sensors (Basel). 2020 Aug 30;20(17):4897. doi: 10.3390/s20174897.

Dual modality prompt learning for visual question-grounded answering in robotic surgery.用于机器人手术中视觉问题引导式回答的双模态提示学习

Vis Comput Ind Biomed Art. 2024 Apr 22;7(1):9. doi: 10.1186/s42492-024-00160-z.

Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering.用于视觉问答的基于丰富视觉知识的增强网络

IEEE Trans Neural Netw Learn Syst. 2021 Oct;32(10):4362-4373. doi: 10.1109/TNNLS.2020.3017530. Epub 2021 Oct 5.

本文引用的文献

General Greedy De-Bias Learning.贪婪去偏置一般性学习。

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):9789-9805. doi: 10.1109/TPAMI.2023.3240337. Epub 2023 Jun 30.

Vision-Language Model for Visual Question Answering in Medical Imagery.用于医学图像视觉问答的视觉语言模型。

Bioengineering (Basel). 2023 Mar 20;10(3):380. doi: 10.3390/bioengineering10030380.

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering.基于深度和视频分割的具身问答视觉注意。

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6807-6819. doi: 10.1109/TPAMI.2021.3139957. Epub 2023 May 5.

Multimodal Machine Learning: A Survey and Taxonomy.多模态机器学习：一项综述与分类法

IEEE Trans Pattern Anal Mach Intell. 2019 Feb;41(2):423-443. doi: 10.1109/TPAMI.2018.2798607. Epub 2018 Jan 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验