Suppr超能文献

用于视觉问答中特征融合的双边跨模态图匹配注意力机制

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering.

作者信息

Cao Jianjian, Qin Xiameng, Zhao Sanyuan, Shen Jianbing

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):4160-4171. doi: 10.1109/TNNLS.2021.3135655. Epub 2025 Feb 28.

Abstract

Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.

摘要

在视觉问答(VQA)任务中,根据图像回答语义复杂的问题具有挑战性。尽管深度学习可以很好地表示图像,但问题总是被简单地嵌入,无法很好地表明其含义。此外,视觉和文本特征在不同模态之间存在差距,难以对齐和利用跨模态信息。在本文中,我们关注这两个问题,并提出了一种图匹配注意力(GMA)网络。首先,它不仅为图像构建图,还根据句法和嵌入信息为问题构建图。接下来,我们通过双阶段图编码器探索模态内关系,然后提出双边跨模态GMA来推断图像和问题之间的关系。更新后的跨模态特征随后被送入答案预测模块进行最终答案预测。实验表明,我们的网络在GQA数据集和VQA 2.0数据集上取得了最优性能。消融研究验证了我们GMA网络中每个模块的有效性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验