Suppr超能文献

多模态显式稀疏注意力网络的视觉问答。

Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.

机构信息

College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.

出版信息

Sensors (Basel). 2020 Nov 26;20(23):6758. doi: 10.3390/s20236758.

Abstract

Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model's attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model's attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.

摘要

视觉问答 (VQA) 是一项涉及自然语言处理 (NLP) 和计算机视觉 (CV) 的多模态任务,它要求模型同时理解视觉信息和文本信息,以预测输入视觉图像和文本问题的正确答案,并已广泛应用于智能和智能交通系统、智慧城市等领域。如今,先进的 VQA 方法通过设计协同注意机制来模拟图像区域和问题词之间的密集交互,从而实现更高的准确性。然而,对每个图像区域和每个问题词之间的交互进行建模将迫使模型计算不相关的信息,从而导致模型的注意力分散。在本文中,为了解决这个问题,我们提出了一种名为多模态显式稀疏注意网络 (MESAN) 的新型模型,该模型通过显式选择与回答输入问题最相关的输入特征部分来集中模型的注意力。我们认为,这种基于 top-k 选择的方法可以减少不相关信息引起的干扰,最终帮助模型取得更好的性能。在基准数据集 VQA v2 上的实验结果证明了我们模型的有效性。我们的最佳单模型在测试开发集和测试标准集上的总体准确率分别达到了 70.71%和 71.08%。此外,我们还通过注意力可视化证明了我们的模型可以比其他先进模型获得更好的注意力特征。我们的工作证明,具有稀疏注意机制的模型也可以在 VQA 数据集上取得有竞争力的结果。我们希望它能够促进 VQA 模型的发展和与 VQA 相关的人工智能 (AI) 技术在各个方面的应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f8ff/7730290/8a3c4d97988e/sensors-20-06758-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验