IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1367-1381. doi: 10.1109/TPAMI.2017.2708709. Epub 2017 May 26.
Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Our final model achieves the best reported results for both image captioning and visual question answering on several of the major benchmark datasets.
近年来,视觉语言问题的研究进展主要得益于卷积神经网络(CNN)和循环神经网络(RNN)的结合。这种方法并没有明确表示高层语义概念,而是试图直接从图像特征转化为文本。在本文中,我们首先提出了一种将高层概念融入到成功的 CNN-RNN 方法中的方法,并表明它在图像字幕和视觉问答方面都取得了显著的提高。我们进一步表明,相同的机制可以用于整合外部知识,这对于回答高级视觉问题至关重要。具体来说,我们设计了一个视觉问答模型,该模型将图像内容的内部表示与从一般知识库中提取的信息相结合,以回答广泛的基于图像的问题。它特别允许在图像本身不包含选择正确答案所需的信息的情况下提出问题。我们的最终模型在几个主要基准数据集上的图像字幕和视觉问答方面都取得了最佳的报告结果。