School of Computer and Communication, Lanzhou University of Technology, Lanzhou, China.
Department of Mathematics and Computer Science, Fort Valley State University, Fort Valley, GA, United States of America.
PLoS One. 2022 Aug 29;17(8):e0271322. doi: 10.1371/journal.pone.0271322. eCollection 2022.
Most image content modelling methods are designed for English description which is different form Chinese in syntax structure. The few existing Chinese image description models do not fully integrate the global features and the local features of an image, limiting the capability of the models to represent the details of the image. In this paper, an encoder-decoder architecture based on the fusion of global and local features is used to describe the Chinese image content. In the encoding stage, the global and local features of the image are extracted by the Convolutional Neural Network (CNN) and the target detection network, and fed to the feature fusion module. In the decoding stage, an image feature attention mechanism is used to calculate the weights of word vectors, and a new gating mechanism is added to the traditional Long Short-Term Memory (LSTM) network to emphasize the fused image features, and the corresponding word vectors. In the description generation stage, the beam search algorithm is used to optimize the word vector generation process. The integration of global and local features of the image is strengthened to allow the model to fully understand the details of the image through the above three stages. The experimental results show that the model improves the quality of Chinese description of image content. Compared with the baseline model, the score of CIDEr evaluation index improves by 20.07%, and other evaluation indices also improve significantly.
大多数图像内容建模方法都是为英语描述设计的,其语法结构与中文不同。现有的少数中文图像描述模型没有充分整合图像的全局特征和局部特征,限制了模型对图像细节的表示能力。本文提出了一种基于全局特征和局部特征融合的编解码器架构,用于描述中文图像内容。在编码阶段,通过卷积神经网络(CNN)和目标检测网络提取图像的全局和局部特征,并将其输入到特征融合模块中。在解码阶段,使用图像特征注意力机制计算词向量的权重,并在传统的长短期记忆(LSTM)网络中添加新的门控机制,以强调融合后的图像特征和相应的词向量。在描述生成阶段,使用束搜索算法优化词向量生成过程。通过以上三个阶段,增强了图像的全局和局部特征的融合,使模型能够通过充分理解图像的细节来提高中文图像内容描述的质量。实验结果表明,与基线模型相比,该模型的 CIDEr 评估指标得分提高了 20.07%,其他评估指标也有显著提高。