通过跨模态检索和模型适配实现跨域图像字幕生成

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.

作者信息

Zhao Wentian, Wu Xinxiao, Luo Jiebo

出版信息

IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.

DOI:10.1109/TIP.2020.3042086

Abstract

In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.

摘要

近年来，大规模的配对图像和句子数据集在自动生成图像描述（即图像字幕）方面取得了显著成功。然而，在每个领域收集足够数量的配对图像和句子既费力又耗时。将在现有领域使用图像和句子对（即源域）训练的图像字幕模型转移到仅具有未配对数据的新领域（即目标域）可能会有所帮助。在本文中，我们提出了一种跨模态检索辅助的跨域图像字幕方法，该方法利用跨模态检索模型在目标域中生成图像和句子的伪对，以促进字幕模型的适应。为了学习目标域中图像和句子之间的相关性，我们提出了一种迭代跨模态检索过程，其中首先使用源域数据对跨模态检索模型进行预训练，然后将其应用于目标域数据以获取一组初始的伪图像-句子对。通过使用伪图像-句子对迭代微调检索模型并使用检索模型更新伪图像-句子对，进一步优化伪图像-句子对。为了使在源域中学习的句子的语言模式能够很好地适应目标域，我们提出了一种具有自注意力机制的自适应图像字幕模型，该模型使用优化后的伪图像-句子对进行微调。在以MSCOCO作为源域，五个不同数据集（Flickr30k、TGIF、CUB-200、Oxford-102和Conceptual）作为目标域的几种设置下的实验结果表明，我们的方法与现有最先进的方法相比，大多取得了更好或相当的性能。我们还将我们的方法扩展到跨域视频字幕，其中将MSR-VTT用作源域，另外两个数据集（MSVD和Charades Captions）用作目标域，以进一步证明我们方法的有效性。

相似文献

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.通过跨模态检索和模型适配实现跨域图像字幕生成

IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.

Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning.利用跨模态预测和关系一致性进行半监督图像字幕生成

IEEE Trans Cybern. 2024 Feb;54(2):890-902. doi: 10.1109/TCYB.2022.3156367. Epub 2024 Jan 17.

Topic-Oriented Image Captioning Based on Order-Embedding.基于序嵌入的主题导向图像字幕生成

IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.

Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs.图像-文本手术：通过生成伪对在图像字幕中进行高效概念学习

IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5910-5921. doi: 10.1109/TNNLS.2018.2813306. Epub 2018 Apr 5.

Discriminative Style Learning for Cross-Domain Image Captioning.用于跨域图像字幕的判别式风格学习

IEEE Trans Image Process. 2022;31:1723-1736. doi: 10.1109/TIP.2022.3145158. Epub 2022 Feb 8.

Deep Relation Embedding for Cross-Modal Retrieval.深度关系嵌入的跨模态检索。

IEEE Trans Image Process. 2021;30:617-627. doi: 10.1109/TIP.2020.3038354. Epub 2020 Dec 1.

Deep Visual-Semantic Alignments for Generating Image Descriptions.深度视觉-语义对齐生成图像描述。

IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):664-676. doi: 10.1109/TPAMI.2016.2598339. Epub 2016 Aug 5.

An Ensemble of Generation- and Retrieval-based Image Captioning with Dual Generator Generative Adversarial Network.基于双生成器生成对抗网络的基于生成与检索的图像字幕集成。

IEEE Trans Image Process. 2020 Oct 15;PP. doi: 10.1109/TIP.2020.3028651.

Towards Generating and Evaluating Iconographic Image Captions of Artworks.迈向生成与评估艺术作品的图像说明文字

J Imaging. 2021 Jul 23;7(8):123. doi: 10.3390/jimaging7080123.

引用本文的文献

Multi-Modal Fake News Detection via Bridging the Gap between Modals.通过弥合模态之间的差距进行多模态假新闻检测

Entropy (Basel). 2023 Apr 4;25(4):614. doi: 10.3390/e25040614.

Research on image content description in Chinese based on fusion of image global and local features.基于图像全局和局部特征融合的中文图像内容描述研究。

PLoS One. 2022 Aug 29;17(8):e0271322. doi: 10.1371/journal.pone.0271322. eCollection 2022.

通过跨模态检索和模型适配实现跨域图像字幕生成

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.

作者信息

Zhao Wentian, Wu Xinxiao, Luo Jiebo

出版信息

IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.

DOI:10.1109/TIP.2020.3042086

PMID:33306468

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过跨模态检索和模型适配实现跨域图像字幕生成

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.

作者信息

出版信息

相似文献

引用本文的文献

通过跨模态检索和模型适配实现跨域图像字幕生成

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.

作者信息

出版信息

相似文献

引用本文的文献