IEEE Trans Cybern. 2022 Jun;52(6):4520-4533. doi: 10.1109/TCYB.2020.3029423. Epub 2022 Jun 16.
Visual question answering (VQA) has gained increasing attention in both natural language processing and computer vision. The attention mechanism plays a crucial role in relating the question to meaningful image regions for answer inference. However, most existing VQA methods: 1) learn the attention distribution either from free-form regions or detection boxes in the image, which is intractable in answering questions about the foreground object and background form, respectively and 2) neglect the prior knowledge of human attention and learn the attention distribution with an unguided strategy. To fully exploit the advantages of attention, the learned attention distribution should focus more on the question-related image regions, such as human attention for both the questions, about the foreground object and background form. To achieve this, this article proposes a novel VQA model, called adversarial learning of supervised attentions (ALSAs). Specifically, two supervised attention modules: 1) free form-based and 2) detection-based, are designed to exploit the prior knowledge for attention distribution learning. To effectively learn the correlations between the question and image from different views, that is, free-form regions and detection boxes, an adversarial learning mechanism is implemented as an interplay between two supervised attention modules. The adversarial learning reinforces the two attention modules mutually to make the learned multiview features more effective for answer inference. The experiments performed on three commonly used VQA datasets confirm the favorable performance of ALSA.
视觉问答 (VQA) 在自然语言处理和计算机视觉领域受到了越来越多的关注。注意力机制在将问题与图像中有意义的区域相关联以进行答案推断方面起着至关重要的作用。然而,大多数现有的 VQA 方法:1)从图像中的自由形式区域或检测框中学习注意力分布,这在回答关于前景对象和背景形式的问题时分别是棘手的,2)忽略了人类注意力的先验知识,并采用无指导的策略来学习注意力分布。为了充分利用注意力的优势,学习到的注意力分布应该更集中于与问题相关的图像区域,例如人类对关于前景对象和背景形式的问题的注意力。为了实现这一点,本文提出了一种新的 VQA 模型,称为监督注意力的对抗学习(ALSAs)。具体来说,设计了两个监督注意力模块:1)基于自由形式的和 2)基于检测的,用于利用注意力分布学习的先验知识。为了从不同的视角(即自由形式的区域和检测框)有效地学习问题和图像之间的相关性,实现了对抗学习机制,作为两个监督注意力模块之间的相互作用。对抗学习相互加强两个注意力模块,使学习到的多视图特征更有效地进行答案推断。在三个常用的 VQA 数据集上进行的实验证实了 ALSA 的良好性能。