• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视觉问答的多模态注意力对抗学习

Adversarial Learning With Multi-Modal Attention for Visual Question Answering.

作者信息

Liu Yun, Zhang Xiaoming, Huang Feiran, Cheng Lei, Li Zhoujun

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 Sep;32(9):3894-3908. doi: 10.1109/TNNLS.2020.3016083. Epub 2021 Aug 31.

DOI:10.1109/TNNLS.2020.3016083
PMID:32833656
Abstract

Visual question answering (VQA) has been proposed as a challenging task and attracted extensive research attention. It aims to learn a joint representation of the question-image pair for answer inference. Most of the existing methods focus on exploring the multi-modal correlation between the question and image to learn the joint representation. However, the answer-related information is not fully captured by these methods, which results that the learned representation is ineffective to reflect the answer of the question. To tackle this problem, we propose a novel model, i.e., adversarial learning with multi-modal attention (ALMA), for VQA. An adversarial learning-based framework is proposed to learn the joint representation to effectively reflect the answer-related information. Specifically, multi-modal attention with the Siamese similarity learning method is designed to build two embedding generators, i.e., question-image embedding and question-answer embedding. Then, adversarial learning is conducted as an interplay between the two embedding generators and an embedding discriminator. The generators have the purpose of generating two modality-invariant representations for the question-image and question-answer pairs, whereas the embedding discriminator aims to discriminate the two representations. Both the multi-modal attention module and the adversarial networks are integrated into an end-to-end unified framework to infer the answer. Experiments performed on three benchmark data sets confirm the favorable performance of ALMA compared with state-of-the-art approaches.

摘要

视觉问答(VQA)已被提出作为一项具有挑战性的任务,并吸引了广泛的研究关注。它旨在学习问题 - 图像对的联合表示以进行答案推理。现有的大多数方法都专注于探索问题与图像之间的多模态相关性以学习联合表示。然而,这些方法并未充分捕捉到与答案相关的信息,这导致所学习的表示无法有效地反映问题的答案。为了解决这个问题,我们提出了一种新颖的模型,即用于VQA的多模态注意力对抗学习(ALMA)。提出了一种基于对抗学习的框架来学习联合表示,以有效地反映与答案相关的信息。具体而言,采用连体相似性学习方法的多模态注意力被设计用于构建两个嵌入生成器,即问题 - 图像嵌入和问题 - 答案嵌入。然后,对抗学习在两个嵌入生成器和一个嵌入判别器之间进行交互。生成器的目的是为问题 - 图像对和问题 - 答案对生成两个模态不变的表示,而嵌入判别器旨在区分这两个表示。多模态注意力模块和对抗网络都集成到一个端到端的统一框架中以推断答案。在三个基准数据集上进行的实验证实了与现有方法相比,ALMA具有良好的性能。

相似文献

1
Adversarial Learning With Multi-Modal Attention for Visual Question Answering.用于视觉问答的多模态注意力对抗学习
IEEE Trans Neural Netw Learn Syst. 2021 Sep;32(9):3894-3908. doi: 10.1109/TNNLS.2020.3016083. Epub 2021 Aug 31.
2
ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering.ALSA:用于视觉问答的有监督注意力对抗学习。
IEEE Trans Cybern. 2022 Jun;52(6):4520-4533. doi: 10.1109/TCYB.2020.3029423. Epub 2022 Jun 16.
3
Parallel multi-head attention and term-weighted question embedding for medical visual question answering.用于医学视觉问答的并行多头注意力机制和词加权问题嵌入
Multimed Tools Appl. 2023 Mar 11:1-22. doi: 10.1007/s11042-023-14981-2.
4
Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks.通过多模态条件对抗网络实现开放式视频问答
IEEE Trans Image Process. 2020 Jan 29. doi: 10.1109/TIP.2020.2963950.
5
Multi-modal adaptive gated mechanism for visual question answering.多模态自适应门控机制的视觉问答。
PLoS One. 2023 Jun 28;18(6):e0287557. doi: 10.1371/journal.pone.0287557. eCollection 2023.
6
Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.多模态显式稀疏注意力网络的视觉问答。
Sensors (Basel). 2020 Nov 26;20(23):6758. doi: 10.3390/s20236758.
7
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding.知识引导的视觉问题推理:深度表示嵌入面临的挑战
IEEE Trans Neural Netw Learn Syst. 2022 Jul;33(7):2758-2767. doi: 10.1109/TNNLS.2020.3045034. Epub 2022 Jul 6.
8
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks.通过分层注意力上下文增强网络实现多轮视频问答
IEEE Trans Image Process. 2019 Aug;28(8):3860-3872. doi: 10.1109/TIP.2019.2902106. Epub 2019 Feb 27.
9
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.
10
Bridging the Cross-Modality Semantic Gap in Visual Question Answering.弥合视觉问答中的跨模态语义鸿沟。
IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):4519-4531. doi: 10.1109/TNNLS.2024.3370925. Epub 2025 Feb 28.

引用本文的文献

1
The multi-modal fusion in visual question answering: a review of attention mechanisms.视觉问答中的多模态融合:注意力机制综述
PeerJ Comput Sci. 2023 May 30;9:e1400. doi: 10.7717/peerj-cs.1400. eCollection 2023.
2
Adversarial Learning with Bidirectional Attention for Visual Question Answering.基于双向注意力的对抗式学习在视觉问答中的应用。
Sensors (Basel). 2021 Oct 28;21(21):7164. doi: 10.3390/s21217164.