Liu Yun, Zhang Xiaoming, Huang Feiran, Cheng Lei, Li Zhoujun
IEEE Trans Neural Netw Learn Syst. 2021 Sep;32(9):3894-3908. doi: 10.1109/TNNLS.2020.3016083. Epub 2021 Aug 31.
Visual question answering (VQA) has been proposed as a challenging task and attracted extensive research attention. It aims to learn a joint representation of the question-image pair for answer inference. Most of the existing methods focus on exploring the multi-modal correlation between the question and image to learn the joint representation. However, the answer-related information is not fully captured by these methods, which results that the learned representation is ineffective to reflect the answer of the question. To tackle this problem, we propose a novel model, i.e., adversarial learning with multi-modal attention (ALMA), for VQA. An adversarial learning-based framework is proposed to learn the joint representation to effectively reflect the answer-related information. Specifically, multi-modal attention with the Siamese similarity learning method is designed to build two embedding generators, i.e., question-image embedding and question-answer embedding. Then, adversarial learning is conducted as an interplay between the two embedding generators and an embedding discriminator. The generators have the purpose of generating two modality-invariant representations for the question-image and question-answer pairs, whereas the embedding discriminator aims to discriminate the two representations. Both the multi-modal attention module and the adversarial networks are integrated into an end-to-end unified framework to infer the answer. Experiments performed on three benchmark data sets confirm the favorable performance of ALMA compared with state-of-the-art approaches.
视觉问答(VQA)已被提出作为一项具有挑战性的任务,并吸引了广泛的研究关注。它旨在学习问题 - 图像对的联合表示以进行答案推理。现有的大多数方法都专注于探索问题与图像之间的多模态相关性以学习联合表示。然而,这些方法并未充分捕捉到与答案相关的信息,这导致所学习的表示无法有效地反映问题的答案。为了解决这个问题,我们提出了一种新颖的模型,即用于VQA的多模态注意力对抗学习(ALMA)。提出了一种基于对抗学习的框架来学习联合表示,以有效地反映与答案相关的信息。具体而言,采用连体相似性学习方法的多模态注意力被设计用于构建两个嵌入生成器,即问题 - 图像嵌入和问题 - 答案嵌入。然后,对抗学习在两个嵌入生成器和一个嵌入判别器之间进行交互。生成器的目的是为问题 - 图像对和问题 - 答案对生成两个模态不变的表示,而嵌入判别器旨在区分这两个表示。多模态注意力模块和对抗网络都集成到一个端到端的统一框架中以推断答案。在三个基准数据集上进行的实验证实了与现有方法相比,ALMA具有良好的性能。