Zhao Wentian, Wu Xinxiao, Luo Jiebo
IEEE Trans Image Process. 2021;30:1180-1192. doi: 10.1109/TIP.2020.3042086. Epub 2020 Dec 17.
In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.
近年来,大规模的配对图像和句子数据集在自动生成图像描述(即图像字幕)方面取得了显著成功。然而,在每个领域收集足够数量的配对图像和句子既费力又耗时。将在现有领域使用图像和句子对(即源域)训练的图像字幕模型转移到仅具有未配对数据的新领域(即目标域)可能会有所帮助。在本文中,我们提出了一种跨模态检索辅助的跨域图像字幕方法,该方法利用跨模态检索模型在目标域中生成图像和句子的伪对,以促进字幕模型的适应。为了学习目标域中图像和句子之间的相关性,我们提出了一种迭代跨模态检索过程,其中首先使用源域数据对跨模态检索模型进行预训练,然后将其应用于目标域数据以获取一组初始的伪图像-句子对。通过使用伪图像-句子对迭代微调检索模型并使用检索模型更新伪图像-句子对,进一步优化伪图像-句子对。为了使在源域中学习的句子的语言模式能够很好地适应目标域,我们提出了一种具有自注意力机制的自适应图像字幕模型,该模型使用优化后的伪图像-句子对进行微调。在以MSCOCO作为源域,五个不同数据集(Flickr30k、TGIF、CUB-200、Oxford-102和Conceptual)作为目标域的几种设置下的实验结果表明,我们的方法与现有最先进的方法相比,大多取得了更好或相当的性能。我们还将我们的方法扩展到跨域视频字幕,其中将MSR-VTT用作源域,另外两个数据集(MSVD和Charades Captions)用作目标域,以进一步证明我们方法的有效性。