Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Republic of Korea.
Department of Artificial Intelligence, Kyungpook National University, Daehak-ro, Buk-gu, Daegu 41566, Republic of Korea.
Neural Netw. 2021 Jul;139:158-167. doi: 10.1016/j.neunet.2021.02.001. Epub 2021 Feb 24.
Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.
视觉问答需要对图像和自然语言有深入的理解。然而,大多数方法主要侧重于视觉概念,例如各种对象之间的关系。有限地使用对象类别及其关系或简单的问题嵌入不足以表示复杂的场景和解释决策。为了解决这个限制,我们提出使用为图像生成的文本表达,因为这种表达的结构约束较少,可以为图像提供更丰富的描述。生成的表达式可以与视觉特征和问题嵌入相结合,以获得与问题相关的答案。还提出了一个联合嵌入多头注意网络,以共同注意的方式对三种不同的信息模态进行建模。我们在 VQA v2 数据集上对所提出的方法进行了定量和定性评估,并在答案预测方面与最先进的方法进行了比较。还在 RefCOCO、RefCOCO+和 RefCOCOg 数据集上评估了生成的表达式的质量。实验结果证明了所提出方法的有效性,并表明在定量和定性结果方面,它都优于所有竞争方法。