• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过跨模态检索和模型适配实现跨域图像字幕生成

Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.

作者信息

Zhao Wentian, Wu Xinxiao, Luo Jiebo

出版信息

IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.

DOI:10.1109/TIP.2020.3042086
PMID:33306468
Abstract

In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.

摘要

近年来,大规模的配对图像和句子数据集在自动生成图像描述(即图像字幕)方面取得了显著成功。然而,在每个领域收集足够数量的配对图像和句子既费力又耗时。将在现有领域使用图像和句子对(即源域)训练的图像字幕模型转移到仅具有未配对数据的新领域(即目标域)可能会有所帮助。在本文中,我们提出了一种跨模态检索辅助的跨域图像字幕方法,该方法利用跨模态检索模型在目标域中生成图像和句子的伪对,以促进字幕模型的适应。为了学习目标域中图像和句子之间的相关性,我们提出了一种迭代跨模态检索过程,其中首先使用源域数据对跨模态检索模型进行预训练,然后将其应用于目标域数据以获取一组初始的伪图像-句子对。通过使用伪图像-句子对迭代微调检索模型并使用检索模型更新伪图像-句子对,进一步优化伪图像-句子对。为了使在源域中学习的句子的语言模式能够很好地适应目标域,我们提出了一种具有自注意力机制的自适应图像字幕模型,该模型使用优化后的伪图像-句子对进行微调。在以MSCOCO作为源域,五个不同数据集(Flickr30k、TGIF、CUB-200、Oxford-102和Conceptual)作为目标域的几种设置下的实验结果表明,我们的方法与现有最先进的方法相比,大多取得了更好或相当的性能。我们还将我们的方法扩展到跨域视频字幕,其中将MSR-VTT用作源域,另外两个数据集(MSVD和Charades Captions)用作目标域,以进一步证明我们方法的有效性。

相似文献

1
Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation.通过跨模态检索和模型适配实现跨域图像字幕生成
IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.
2
Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning.利用跨模态预测和关系一致性进行半监督图像字幕生成
IEEE Trans Cybern. 2024 Feb;54(2):890-902. doi: 10.1109/TCYB.2022.3156367. Epub 2024 Jan 17.
3
Topic-Oriented Image Captioning Based on Order-Embedding.基于序嵌入的主题导向图像字幕生成
IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.
4
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.
5
Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs.图像-文本手术:通过生成伪对在图像字幕中进行高效概念学习
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5910-5921. doi: 10.1109/TNNLS.2018.2813306. Epub 2018 Apr 5.
6
Discriminative Style Learning for Cross-Domain Image Captioning.用于跨域图像字幕的判别式风格学习
IEEE Trans Image Process. 2022;31:1723-1736. doi: 10.1109/TIP.2022.3145158. Epub 2022 Feb 8.
7
Deep Relation Embedding for Cross-Modal Retrieval.深度关系嵌入的跨模态检索。
IEEE Trans Image Process. 2021;30:617-627. doi: 10.1109/TIP.2020.3038354. Epub 2020 Dec 1.
8
Deep Visual-Semantic Alignments for Generating Image Descriptions.深度视觉-语义对齐生成图像描述。
IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):664-676. doi: 10.1109/TPAMI.2016.2598339. Epub 2016 Aug 5.
9
An Ensemble of Generation- and Retrieval-based Image Captioning with Dual Generator Generative Adversarial Network.基于双生成器生成对抗网络的基于生成与检索的图像字幕集成。
IEEE Trans Image Process. 2020 Oct 15;PP. doi: 10.1109/TIP.2020.3028651.
10
Towards Generating and Evaluating Iconographic Image Captions of Artworks.迈向生成与评估艺术作品的图像说明文字
J Imaging. 2021 Jul 23;7(8):123. doi: 10.3390/jimaging7080123.

引用本文的文献

1
Multi-Modal Fake News Detection via Bridging the Gap between Modals.通过弥合模态之间的差距进行多模态假新闻检测
Entropy (Basel). 2023 Apr 4;25(4):614. doi: 10.3390/e25040614.
2
Research on image content description in Chinese based on fusion of image global and local features.基于图像全局和局部特征融合的中文图像内容描述研究。
PLoS One. 2022 Aug 29;17(8):e0271322. doi: 10.1371/journal.pone.0271322. eCollection 2022.