Zhao Cairong, Wang Yubin, Jiang Xinyang, Shen Yifei, Song Kaitao, Li Dongsheng, Miao Duoqian
IEEE Trans Image Process. 2024;33:1348-1360. doi: 10.1109/TIP.2024.3362062. Epub 2024 Feb 14.
Prompt learning stands out as one of the most efficient approaches for adapting powerful vision-language foundational models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, despite its success in achieving remarkable performance on in-domain data, prompt learning still faces the significant challenge of effectively generalizing to novel classes and domains. Some existing methods address this concern by dynamically generating distinct prompts for different domains. Yet, they overlook the inherent potential of prompts to generalize across unseen domains. To address these limitations, our study introduces an innovative prompt learning paradigm, called MetaPrompt, aiming to directly learn domain invariant prompt in few-shot scenarios. To facilitate learning prompts for image and text inputs independently, we present a dual-modality prompt tuning network comprising two pairs of coupled encoders. Our study centers on an alternate episodic training algorithm to enrich the generalization capacity of the learned prompts. In contrast to traditional episodic training algorithms, our approach incorporates both in-domain updates and domain-split updates in a batch-wise manner. For in-domain updates, we introduce a novel asymmetric contrastive learning paradigm, where representations from the pre-trained encoder assume supervision to regularize prompts from the prompted encoder. To enhance performance on out-of-domain distribution, we propose a domain-split optimization on visual prompts for cross-domain tasks or textual prompts for cross-class tasks during domain-split updates. Extensive experiments across 11 datasets for base-to-new generalization and 4 datasets for domain generalization exhibit favorable performance. Compared with the state-of-the-art method, MetaPrompt achieves an absolute gain of 1.02% on the overall harmonic mean in base-to-new generalization and consistently demonstrates superiority over all benchmarks in domain generalization.
提示学习是一种最有效的方法之一,通过用极少的样本调整可学习的提示向量,使像CLIP这样强大的视觉语言基础模型适应下游数据集。然而,尽管它在域内数据上取得了显著性能,但提示学习在有效泛化到新类别和新领域方面仍面临重大挑战。一些现有方法通过为不同领域动态生成不同的提示来解决这一问题。然而,它们忽略了提示在跨未见领域泛化的内在潜力。为了解决这些局限性,我们的研究引入了一种创新的提示学习范式,称为元提示,旨在在少样本场景中直接学习域不变提示。为了便于独立学习图像和文本输入的提示,我们提出了一种双模态提示调整网络,它由两对耦合编码器组成。我们的研究集中在一种交替情节训练算法上,以增强所学习提示的泛化能力。与传统的情节训练算法不同,我们的方法以批处理方式结合了域内更新和域分割更新。对于域内更新,我们引入了一种新颖的不对称对比学习范式,其中来自预训练编码器的表示承担监督作用,以规范来自提示编码器的提示。为了提高域外分布上的性能,我们在域分割更新期间针对跨域任务的视觉提示或跨类任务的文本提示提出了一种域分割优化方法。在11个用于基到新泛化的数据集和4个用于域泛化的数据集上进行的广泛实验显示出良好的性能。与最先进的方法相比,元提示在基到新泛化的总体调和均值上实现了1.02%的绝对增益,并且在域泛化方面始终优于所有基准。