Jang Young Kyun, Kim Donghyun
Meta Platforms (United States), Menlo Park, USA.
Korea University, Seoul, Republic of Korea.
Sci Rep. 2025 Jul 28;15(1):27463. doi: 10.1038/s41598-025-07798-6.
Composed Image Retrieval (CIR) retrieves a target image similar to a reference image, guided by a provided textual modification (i.e., a triplet with<reference image, text, target image>). Previous works on CIR can largely be developed into two categories: supervised learning approaches and weakly supervised (i.e., zero-shot) learning approaches. Supervised learning CIR models require labeled triplets which may not be easily obtained and limit the widespread use of CIR and its scalability. On the other hand, a weakly supervised learning approach (also called zero-shot CIR), can be relatively easily trained with image-caption pairs without considering the image-to-image relation (i.e., no supervised triplet required), but this approach tends to yield lower accuracy. In this paper, we extend the application of existing Composed Image Retrieval (CIR) into semi-supervised learning, domain adaptation, and test-time adaptation contexts by exploiting only unlabeled image data. Previous approaches cannot be directly applied to these settings, as it is not trivial to leverage fully unlabeled data for CIR. To address this, we propose a new approach and settings where we identify a reference image and its associated target images in auxiliary image data. Our method involves training a large language model-based Visual Delta Generator (VDG) to produce textual descriptions of the visual differences (i.e., visual deltas) between these images. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo-triplets to boost the performance of CIR models in diverse settings including semi-supervised CIR, domain adaptation, and test-time adaptation. Our approach significantly not only improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks but also expands the application of CIR across diverse settings.
合成图像检索(CIR)通过提供的文本修改(即由<参考图像,文本,目标图像>组成的三元组)来检索与参考图像相似的目标图像。先前关于CIR的工作大致可分为两类:监督学习方法和弱监督(即零样本)学习方法。监督学习的CIR模型需要有标签的三元组,而这些三元组可能不容易获得,并且限制了CIR的广泛应用及其可扩展性。另一方面,弱监督学习方法(也称为零样本CIR)可以相对容易地使用图像-标题对进行训练,而无需考虑图像与图像之间的关系(即不需要有监督的三元组),但这种方法往往会导致较低的准确率。在本文中,我们仅利用未标记的图像数据,将现有的合成图像检索(CIR)应用扩展到半监督学习、域适应和测试时适应的场景中。先前的方法不能直接应用于这些场景,因为利用完全未标记的数据进行CIR并非易事。为了解决这个问题,我们提出了一种新的方法和设置,即在辅助图像数据中识别参考图像及其相关的目标图像。我们的方法包括训练一个基于大语言模型的视觉差异生成器(VDG),以生成这些图像之间视觉差异(即视觉差异)的文本描述。VDG具备流利的语言知识且与模型无关,可以生成伪三元组,以提高CIR模型在包括半监督CIR、域适应和测试时适应在内的各种场景中的性能。我们的方法不仅显著改进了现有的监督学习方法,并在CIR基准测试中取得了领先的结果,还扩展了CIR在各种场景中的应用。