Suppr超能文献

合成CLIP:合成数据使CLIP在数据有限的场景中泛化能力更强。

Synth-CLIP: Synthetic data make CLIP generalize better in data-limited scenarios.

作者信息

Liu Mushui, He Weijie, Lu Ziqian, Dan Jun, Yu Yunlong, Li Yingming, Li Xi, Han Jungong

机构信息

College of Information Science and Electronic Engineering, Zhejiang University, China.

School of Aeronautics and Astronautics, Zhejiang University, China.

出版信息

Neural Netw. 2025 Apr;184:107083. doi: 10.1016/j.neunet.2024.107083. Epub 2024 Dec 30.

Abstract

Prompt learning is a powerful technique that enables the transfer of Vision-Language Models (VLMs) like CLIP to downstream tasks. However, when the prompt-based methods are fine-tuned solely on base classes, they often struggle to generalize to novel classes lacking visual samples during training, especially in scenarios with limited training data. To address this challenge, we propose an innovative approach called Synth-CLIP that leverages synthetic data to enhance CLIP's generalization capability for base classes and the general capability for novel classes. Synth-CLIP fine-tunes the pre-trained CLIP model by seamlessly integrating tailored prompts that are both domain-specific and domain-shared, specifically designed for visual samples, reorganizing visual features from real and synthetic domains into the semantic space. This approach efficiently expands the data pool and enriches category diversity. Moreover, based on semantic structure consistency, we introduce a cross-domain feature alignment loss to match the real and synthetic samples in the feature embedding space. By aligning the visual and semantic distributions, the synthetic data from base and novel classes provide crucial discriminative information, enabling the model to rebalance the decision boundaries even in the absence of real novel visual samples. Experimental results on three model generalization tasks demonstrate that our method performs very competitively across various benchmarks. Notably, Synth-CLIP outperforms the recent competitor PromptSRC by an average improvement of 3.0% on novel classes across 11 datasets in open-vocabulary scenarios.

摘要

提示学习是一种强大的技术,它能使像CLIP这样的视觉语言模型(VLM)应用于下游任务。然而,当基于提示的方法仅在基础类别上进行微调时,它们在训练期间往往难以推广到缺乏视觉样本的新类别,尤其是在训练数据有限的情况下。为了应对这一挑战,我们提出了一种名为Synth-CLIP的创新方法,该方法利用合成数据来增强CLIP对基础类别的泛化能力以及对新类别的一般能力。Synth-CLIP通过无缝集成专门为视觉样本设计的特定领域和领域共享的定制提示,对预训练的CLIP模型进行微调,将来自真实和合成领域的视觉特征重新组织到语义空间中。这种方法有效地扩展了数据池并丰富了类别多样性。此外,基于语义结构一致性,我们引入了一种跨域特征对齐损失,以在特征嵌入空间中匹配真实和合成样本。通过对齐视觉和语义分布,来自基础类和新类的合成数据提供了关键的判别信息,使模型即使在没有真实新视觉样本的情况下也能重新平衡决策边界。在三个模型泛化任务上的实验结果表明,我们的方法在各种基准测试中表现极具竞争力。值得注意的是,在开放词汇场景中的11个数据集中,Synth-CLIP在新类别上的表现比最近的竞争对手PromptSRC平均提高了3.0%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验