基于序列级训练的上下文融合图像字幕生成

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.

机构信息

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, China.

出版信息

Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.

DOI:10.1155/2022/9743123

PMID:35035470

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8754620/

Abstract

Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.

摘要

基于编解码器框架的最近的图像字幕模型在生成类人句子方面取得了显著的成功。然而，编码器和解码器之间的明确分离导致了图像和句子之间的脱节。这通常会导致图像描述粗糙：生成的字幕仅包含主要实例，但忽略了意外的其他对象和场景，从而降低了图像的字幕一致性。为了解决这个问题，我们在本文中提出了一种基于上下文融合引导的图像字幕系统。它将区域和全局图像表示作为组合视觉特征纳入学习图像中的对象和属性。为了整合图像级别的语义信息，使用了视觉概念。为了避免误导解码，引入了上下文融合门，通过选择性地聚合视觉概念和词嵌入的信息来计算文本上下文。然后，基于组合视觉特征和文本上下文，制定上下文融合的图像指导。它为解码器提供了有意义的语义知识。最后，构建了一个具有两层 LSTM 架构的字幕生成器来生成字幕。此外，为了克服曝光偏差，我们通过序列决策来训练我们的模型。在 MS COCO 数据集上的实验表明了我们工作的出色性能。语言分析表明，我们的模型提高了图像的字幕一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3941/8754620/ef8cf7d5b127/CIN2022-9743123.001.jpg

相似文献

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.

Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.

Translating medical image to radiological report: Adaptive multilevel multi-attention approach.

Comput Methods Programs Biomed. 2022 Jun;221:106853. doi: 10.1016/j.cmpb.2022.106853. Epub 2022 May 4.

Auto-Encoding and Distilling Scene Graphs for Image Captioning.

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2313-2327. doi: 10.1109/TPAMI.2020.3042192. Epub 2022 Apr 1.

Dual Global Enhanced Transformer for image captioning.

Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning.

Sensors (Basel). 2024 Mar 11;24(6):1796. doi: 10.3390/s24061796.

Chinese Image Caption Generation via Visual Attention and Topic Modeling.

IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.

Discriminative Style Learning for Cross-Domain Image Captioning.

IEEE Trans Image Process. 2022;31:1723-1736. doi: 10.1109/TIP.2022.3145158. Epub 2022 Feb 8.

Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture.

Sci Rep. 2024 Sep 5;14(1):20762. doi: 10.1038/s41598-024-69664-1.

Weakly Supervised Captioning of Ultrasound Images.

Med Image Underst Anal (2022). 2022 Jul;13413:187-198. doi: 10.1007/978-3-031-12053-4_14.

Topic-Oriented Image Captioning Based on Order-Embedding.

IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.

本文引用的文献

An Overview of Image Caption Generation Methods.

Comput Intell Neurosci. 2020 Jan 9;2020:3062706. doi: 10.1155/2020/3062706. eCollection 2020.

Context-Aware Visual Policy Network for Fine-Grained Image Captioning.

IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):710-722. doi: 10.1109/TPAMI.2019.2909864. Epub 2022 Jan 7.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于序列级训练的上下文融合图像字幕生成

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献