You Chenyu, Min Yifei, Dai Weicheng, Sekhon Jasjeet S, Staib Lawrence, Duncan James S
Yale University.
Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:26140-26150. doi: 10.1109/cvpr52733.2024.02470. Epub 2024 Sep 16.
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.
微调预训练的视觉语言模型,如CLIP,已在各种下游任务中取得成功。然而,这种范式仍存在几个痛点:(i)直接微调整个预训练模型既耗时又计算成本高。此外,这些微调后的模型往往变得高度专业化,限制了它们在实际部署中的实用性;(ii)最近的研究表明,预训练的视觉语言分类器可能过度依赖虚假特征——与训练数据中的目标相关但与真实标签函数无关的模式;(iii)现有的关于减轻对虚假特征依赖的研究,很大程度上基于我们可以识别此类特征的假设,并未为实际应用提供明确保证。作为一项试点研究,这项工作专注于探索在不使用任何组注释的情况下减轻CLIP对虚假特征的依赖。为此,我们系统地研究了CLIP和CLIP+ERM上虚假相关性的存在。我们首先,遵循最近关于深度特征重加权(DFR)的工作,验证最后一层再训练可以大大提高预训练CLIP上的组鲁棒性。鉴于此,我们提倡一种用于微调CLIP的轻量级表示校准方法,即首先使用预训练的CLIP生成一个校准集,然后通过对比学习在校准集内校准样本的表示,所有这些都无需组标签。在几个基准上进行的广泛实验和深入可视化验证了我们提议的有效性,大大减少了依赖并显著提高了模型的泛化能力。我们的代码将在此处提供。