IEEE Trans Cybern. 2021 Feb;51(2):913-926. doi: 10.1109/TCYB.2019.2914351. Epub 2021 Jan 15.
Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text-guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two-level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is SA network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach.
视觉-语言任务旨在将计算机视觉和自然语言处理结合在一起,这引起了许多研究人员的关注。对于典型的方法,它们将图像编码为特征表示,并将其解码为自然语言句子。然而,它们忽略了高层语义概念和图像区域与自然语言元素之间的细微关系。为了充分利用这些信息,本文试图利用文本引导的注意和语义引导的注意(SA)来找到更相关的空间信息,并减少视觉和语言之间的语义差距。我们的方法包括两级注意网络。一个是文本引导的注意网络,用于选择与文本相关的区域。另一个是 SA 网络,用于突出与概念相关的区域和与区域相关的概念。最后,将所有这些信息结合起来生成字幕或答案。实际上,已经进行了图像字幕生成和视觉问答实验,实验结果表明了所提出方法的优异性能。