School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, China.
Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.
Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.
基于编解码器框架的最近的图像字幕模型在生成类人句子方面取得了显著的成功。然而,编码器和解码器之间的明确分离导致了图像和句子之间的脱节。这通常会导致图像描述粗糙:生成的字幕仅包含主要实例,但忽略了意外的其他对象和场景,从而降低了图像的字幕一致性。为了解决这个问题,我们在本文中提出了一种基于上下文融合引导的图像字幕系统。它将区域和全局图像表示作为组合视觉特征纳入学习图像中的对象和属性。为了整合图像级别的语义信息,使用了视觉概念。为了避免误导解码,引入了上下文融合门,通过选择性地聚合视觉概念和词嵌入的信息来计算文本上下文。然后,基于组合视觉特征和文本上下文,制定上下文融合的图像指导。它为解码器提供了有意义的语义知识。最后,构建了一个具有两层 LSTM 架构的字幕生成器来生成字幕。此外,为了克服曝光偏差,我们通过序列决策来训练我们的模型。在 MS COCO 数据集上的实验表明了我们工作的出色性能。语言分析表明,我们的模型提高了图像的字幕一致性。