从展示到讲述：基于深度学习的图像字幕研究综述

From Show to Tell: A Survey on Deep Learning-Based Image Captioning.

作者信息

Stefanini Matteo, Cornia Marcella, Baraldi Lorenzo, Cascianelli Silvia, Fiameni Giuseppe, Cucchiara Rita

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):539-559. doi: 10.1109/TPAMI.2022.3148210. Epub 2022 Dec 5.

DOI:10.1109/TPAMI.2022.3148210

Abstract

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

摘要

连接视觉与语言在生成式智能中起着至关重要的作用。因此，大量研究工作致力于图像字幕，即用句法和语义上有意义的句子描述图像。从2015年开始，该任务通常通过由视觉编码器和用于文本生成的语言模型组成的管道来解决。在这些年里，通过利用对象区域、属性、引入多模态连接、全注意力方法以及类似BERT的早期融合策略，这两个组件都有了很大的发展。然而，尽管取得了令人瞩目的成果，但图像字幕研究尚未得出最终答案。这项工作旨在全面概述图像字幕方法，从视觉编码、文本生成到训练策略、数据集和评估指标。在这方面，我们定量比较了许多相关的最新方法，以确定架构和训练策略中最具影响力的技术创新。此外，还讨论了该问题的许多变体及其开放挑战。这项工作的最终目标是作为一种工具，用于理解现有文献并突出计算机视觉和自然语言处理能够找到最佳协同作用的研究领域的未来方向。

相似文献

From Show to Tell: A Survey on Deep Learning-Based Image Captioning.

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):539-559. doi: 10.1109/TPAMI.2022.3148210. Epub 2022 Dec 5.

From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation.

Med Image Anal. 2024 Oct;97:103264. doi: 10.1016/j.media.2024.103264. Epub 2024 Jul 8.

Arabic Captioning for Images of Clothing Using Deep Learning.

Sensors (Basel). 2023 Apr 7;23(8):3783. doi: 10.3390/s23083783.

Chinese Image Caption Generation via Visual Attention and Topic Modeling.

IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.

Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.

IEEE J Biomed Health Inform. 2022 Dec;26(12):6070-6080. doi: 10.1109/JBHI.2022.3207502. Epub 2022 Dec 7.

Context-Aware Visual Policy Network for Fine-Grained Image Captioning.

IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):710-722. doi: 10.1109/TPAMI.2019.2909864. Epub 2022 Jan 7.

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.

J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.

Image Captioning Based on Semantic Scenes.

Entropy (Basel). 2024 Oct 18;26(10):876. doi: 10.3390/e26100876.

A Survey on Learning Objects' Relationship for Image Captioning.

Comput Intell Neurosci. 2023 May 29;2023:8600853. doi: 10.1155/2023/8600853. eCollection 2023.

Mining core information by evaluating semantic importance for unpaired image captioning.

Neural Netw. 2024 Nov;179:106519. doi: 10.1016/j.neunet.2024.106519. Epub 2024 Jul 9.

引用本文的文献

UICD: A new dataset and approach for urdu image captioning.

PLoS One. 2025 Jun 2;20(6):e0320701. doi: 10.1371/journal.pone.0320701. eCollection 2025.

Adaptafood: an intelligent system to adapt recipes to specialised diets and healthy lifestyles.

Multimed Syst. 2025;31(1):87. doi: 10.1007/s00530-025-01667-y. Epub 2025 Feb 1.

Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture.

Sci Rep. 2024 Sep 5;14(1):20762. doi: 10.1038/s41598-024-69664-1.

FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer.

NPJ Digit Med. 2024 May 3;7(1):111. doi: 10.1038/s41746-024-01101-z.

Improved Image Caption Rating - Datasets, Game, and Model.

Ext Abstr Hum Factors Computing Syst. 2023 Apr;2023. doi: 10.1145/3544549.3585632. Epub 2023 Apr 19.

Exploring the potential of ChatGPT as an adjunct for generating diagnosis based on chief complaint and cone beam CT radiologic findings.

BMC Med Inform Decis Mak. 2024 Feb 19;24(1):55. doi: 10.1186/s12911-024-02445-y.

Images, Words, and Imagination: Accessible Descriptions to Support Blind and Low Vision Art Exploration and Engagement.

J Imaging. 2024 Jan 18;10(1):26. doi: 10.3390/jimaging10010026.

Towards Artificial Intelligence Applications in Next Generation Cytopathology.

Biomedicines. 2023 Aug 8;11(8):2225. doi: 10.3390/biomedicines11082225.

PathNarratives: Data annotation for pathological human-AI collaborative diagnosis.

Front Med (Lausanne). 2023 Jan 26;9:1070072. doi: 10.3389/fmed.2022.1070072. eCollection 2022.

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates.

Sensors (Basel). 2023 Jan 23;23(3):1286. doi: 10.3390/s23031286.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从展示到讲述：基于深度学习的图像字幕研究综述

From Show to Tell: A Survey on Deep Learning-Based Image Captioning.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献