IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):9169-9185. doi: 10.1109/TPAMI.2023.3234170. Epub 2023 Jun 5.
Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation.
表示学习是自然语言处理(NLP)的基础。这项工作提出了新的方法,利用视觉信息作为辅助信号来处理一般的 NLP 任务。对于每个句子,我们首先从现有的句子-图像对中提取的灵活数量的图像或预先在货架外的文本-图像对上训练的共享跨模态嵌入空间中检索图像。然后,文本和图像分别由 Transformer 编码器和卷积神经网络进行编码。两个表示序列通过注意力层进一步融合,以实现两种模式的交互。在这项研究中,检索过程是可控和灵活的。通用的视觉表示克服了缺乏大规模双语句子-图像对的问题。我们的方法可以很容易地应用于仅文本任务,而无需手动注释多模态平行语料库。我们将所提出的方法应用于广泛的自然语言生成和理解任务,包括神经机器翻译、自然语言推理和语义相似性。实验结果表明,我们的方法对于不同的任务和语言通常都是有效的。分析表明,视觉信号丰富了内容词的文本表示,提供了关于概念和事件之间关系的细粒度基础信息,并有助于消除歧义。