深入理解对象语义：利用Transformer网络实现高级图像字幕生成

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning.

作者信息

Abdal Hafeth Deema, Kollias Stefanos

机构信息

School of Computer Science, University of Lincoln, Lincoln LN6 7TS, UK.

School of Electrical & Computer Engineering, National Technical University of Athens, 15780 Athens, Greece.

出版信息

Sensors (Basel). 2024 Mar 11;24(6):1796. doi: 10.3390/s24061796.

DOI:10.3390/s24061796

PMID:38544059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10975165/

Abstract

Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder-decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.

摘要

图像字幕是一种用于为图像生成描述性字幕的技术。通常，它涉及使用卷积神经网络（CNN）作为编码器来提取视觉特征，并使用通常基于循环神经网络（RNN）的解码器模型来生成字幕。最近，编码器-解码器架构见证了自注意力机制的广泛采用。然而，这种方法面临一些需要进一步研究的挑战。其中一个挑战是，提取的视觉特征没有充分利用可用的图像信息，主要是由于缺乏语义概念。这种限制限制了完全理解图像中所描绘内容的能力。为了解决这个问题，我们提出了一种基于图像Transformer并增强了图像对象语义表示的新模型。我们的模型在编码器注意力中纳入语义表示，通过整合实例级概念来增强视觉特征。此外，我们在语言生成模块中使用Transformer作为解码器。通过这样做，我们在生成准确且多样的字幕方面取得了更好的性能。我们在MS-COCO和新型MACE数据集上评估了我们模型的性能。结果表明，我们的模型在字幕生成方面与当前的先进方法相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5e6/10975165/fb9002851be0/sensors-24-01796-g001.jpg

相似文献

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning.

Sensors (Basel). 2024 Mar 11;24(6):1796. doi: 10.3390/s24061796.

Enhancing image caption generation through context-aware attention mechanism.

Heliyon. 2024 Aug 19;10(17):e36272. doi: 10.1016/j.heliyon.2024.e36272. eCollection 2024 Sep 15.

Context-Fused Guidance for Image Captioning Using Sequence-Level Training.

Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.

Dual Global Enhanced Transformer for image captioning.

Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.

Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture.

Sci Rep. 2024 Sep 5;14(1):20762. doi: 10.1038/s41598-024-69664-1.

Translating medical image to radiological report: Adaptive multilevel multi-attention approach.

Comput Methods Programs Biomed. 2022 Jun;221:106853. doi: 10.1016/j.cmpb.2022.106853. Epub 2022 May 4.

Auto-Encoding and Distilling Scene Graphs for Image Captioning.

IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2313-2327. doi: 10.1109/TPAMI.2020.3042192. Epub 2022 Apr 1.

A Multilevel Transfer Learning Technique and LSTM Framework for Generating Medical Captions for Limited CT and DBT Images.

J Digit Imaging. 2022 Jun;35(3):564-580. doi: 10.1007/s10278-021-00567-7. Epub 2022 Feb 25.

Chinese Image Caption Generation via Visual Attention and Topic Modeling.

IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.

Adaptive Semantic-Enhanced Transformer for Image Captioning.

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):1785-1796. doi: 10.1109/TNNLS.2022.3185320. Epub 2024 Feb 5.

本文引用的文献

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.

IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):677-691. doi: 10.1109/TPAMI.2016.2599174. Epub 2016 Sep 1.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.

Long short-term memory.

Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

深入理解对象语义：利用Transformer网络实现高级图像字幕生成

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献