双全局增强型 Transformer 用于图像字幕生成。

Dual Global Enhanced Transformer for image captioning.

机构信息

Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China.

出版信息

Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.

DOI:10.1016/j.neunet.2022.01.011

Abstract

Transformer-based architectures have shown great success in image captioning, where self-attention module can model source and target interaction (e.g., object-to-object, object-to-word, word-to-word). However, the global information is not explicitly considered in the attention weight calculation, which is essential to understand the scene content. In this paper, we propose Dual Global Enhanced Transformer (DGET) to incorporate global information in the encoding and decoding stages. Concretely, in DGET, we regard the grid feature as the visual global information and adaptively fuse it into region features in each layer by a novel Global Enhanced Encoder (GEE). During decoding, we proposed Global Enhanced Decoder (GED) to explicitly utilize the textual global information. First, we devise the context encoder to encode the existing caption generated by classic captioner as a context vector. Then, we use the context vector to guide the decoder to generate accurate words at each time step. To validate our model, we conduct extensive experiments on the MS COCO image captioning dataset and achieve superior performance over many state-of-the-art methods.

摘要

基于转换器的架构在图像字幕生成中取得了巨大成功，其中自注意力模块可以对源和目标进行交互建模（例如，对象到对象、对象到词、词到词）。然而，在注意力权重计算中并没有显式地考虑全局信息，这对于理解场景内容至关重要。在本文中，我们提出了双全局增强转换器（DGET），以在编码和解码阶段中引入全局信息。具体来说，在 DGET 中，我们将网格特征视为视觉全局信息，并通过一种新颖的全局增强编码器（GEE）自适应地将其融合到每个层的区域特征中。在解码阶段，我们提出了全局增强解码器（GED），以显式地利用文本全局信息。首先，我们设计了上下文编码器，将经典字幕生成器生成的现有字幕编码为一个上下文向量。然后，我们使用上下文向量引导解码器在每个时间步生成准确的单词。为了验证我们的模型，我们在 MS COCO 图像字幕生成数据集上进行了广泛的实验，在许多最新方法中取得了优异的性能。