Suppr超能文献

通过极性增强和对比实现鲁棒的视觉问答。

Robust visual question answering via polarity enhancement and contrast.

机构信息

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China.

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China.

出版信息

Neural Netw. 2024 Nov;179:106560. doi: 10.1016/j.neunet.2024.106560. Epub 2024 Jul 20.

Abstract

The Visual Question Answering (VQA) task is an important research direction in the field of artificial intelligence, which requires a model that can simultaneously understand visual images and natural language questions, and answer questions related to images. Recent studies have shown that many Visual Question Answering models rely on statistically regular correlations between questions and answers, which in turn weakens the correlation between visual content and textual information. In this work, we propose an unbiased Visual Question Answering method to solve language priors from the perspective of strengthening the contrast between the correct answer and the positive and negative predictions. We design a new model consisting of two modules with different roles. We input the image and the question corresponding to it into the Answer Visual Attention Modules to generate positive prediction output, and then use a Dual Channels Joint Module to generate negative prediction output with great linguistic prior knowledge. Finally, we input the positive and negative predictions together with the correct answer to our newly designed loss function for training. Our method achieves high performance (61.24%) on the VQA-CP v2 dataset. In addition, most existing debiasing methods improve performance on VQA-CP v2 dataset at the cost of reducing performance on VQA v2 dataset, while our method not only does not reduce the accuracy on VQA v2 dataset. Instead, it improves performance on both datasets mentioned above.

摘要

视觉问答 (VQA) 任务是人工智能领域的一个重要研究方向,它需要一个能够同时理解视觉图像和自然语言问题并回答与图像相关问题的模型。最近的研究表明,许多视觉问答模型依赖于问题和答案之间的统计规律相关性,这反过来又削弱了视觉内容和文本信息之间的相关性。在这项工作中,我们提出了一种无偏的视觉问答方法,从加强正确答案和正、负预测之间对比度的角度来解决语言先验问题。我们设计了一个由两个具有不同作用的模块组成的新模型。我们将图像和相应的问题输入到答案视觉注意模块中,生成正预测输出,然后使用双通道联合模块生成具有很大语言先验知识的负预测输出。最后,我们将正、负预测以及正确答案一起输入到我们新设计的损失函数中进行训练。我们的方法在 VQA-CP v2 数据集上取得了 61.24%的高性能。此外,大多数现有的去偏方法在 VQA-CP v2 数据集上提高了性能,但以牺牲在 VQA v2 数据集上的性能为代价,而我们的方法不仅没有降低 VQA v2 数据集的准确性,反而提高了这两个数据集的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验