用于图像字幕生成的多级注意力网络与策略强化学习

Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation.

作者信息

Zhou Zhibo, Zhang Xiaoming, Li Zhoujun, Huang Feiran, Xu Jie

机构信息

School of Computer Science and Engineering, Beihang University, Beijing, China.

School of Cyber Science and Technology, Beihang University, Beijing, China.

出版信息

Big Data. 2022 Dec;10(6):481-492. doi: 10.1089/big.2021.0049. Epub 2021 Nov 2.

DOI:10.1089/big.2021.0049

PMID:34726529

Abstract

The analysis of large-scale multimodal data has become very popular recently. Image captioning, whose goal is to describe the content of image with natural language automatically, is an essential and challenging task in artificial intelligence. Commonly, most existing image caption methods utilize the mixture of Convolutional Neural Network and Recurrent Neural Network framework. These methods either pay attention to global representation at the image level or only focus on the specific concepts, such as regions and objects. To make the most of characteristics about a given image, in this study, we present a novel model named Multilevel Attention Networks and Policy Reinforcement Learning for image caption generation. Specifically, our model is composed of a multilevel attention network module and a policy reinforcement learning module. In the multilevel attention network, the object-attention network aims to capture global and local details about objects, whereas the region-attention network obtains global and local features about regions. After that, a policy reinforcement learning algorithm is adopted to overcome the exposure bias problem in the training phase and solve the loss-evaluation mismatching problem at the caption generation stage. With the attention network and policy algorithm, our model can automatically generate accurate and natural sentences for any particular image. We carry out extensive experiments on the MSCOCO and Flickr30k data sets, demonstrating that our model is superior to other competitive methods.

摘要

近年来，大规模多模态数据的分析变得非常流行。图像字幕生成旨在用自然语言自动描述图像内容，是人工智能中一项重要且具有挑战性的任务。通常，大多数现有的图像字幕方法都采用卷积神经网络和循环神经网络框架的组合。这些方法要么关注图像层面的全局表示，要么只关注特定概念，如区域和物体。为了充分利用给定图像的特征，在本研究中，我们提出了一种名为用于图像字幕生成的多级注意力网络和策略强化学习的新型模型。具体而言，我们的模型由一个多级注意力网络模块和一个策略强化学习模块组成。在多级注意力网络中，目标注意力网络旨在捕捉关于物体的全局和局部细节，而区域注意力网络获取关于区域的全局和局部特征。之后，采用策略强化学习算法来克服训练阶段的曝光偏差问题，并解决字幕生成阶段的损失评估不匹配问题。借助注意力网络和策略算法，我们的模型可以为任何特定图像自动生成准确自然的句子。我们在MSCOCO和Flickr30k数据集上进行了广泛的实验，证明我们的模型优于其他竞争方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于图像字幕生成的多级注意力网络与策略强化学习

Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation.

作者信息

机构信息

出版信息

相似文献

用于图像字幕生成的多级注意力网络与策略强化学习

Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation.

作者信息

机构信息

出版信息

相似文献