Suppr超能文献

校准多模态表示:一种无需标注追求群体鲁棒性的方法。

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.

作者信息

You Chenyu, Min Yifei, Dai Weicheng, Sekhon Jasjeet S, Staib Lawrence, Duncan James S

机构信息

Yale University.

出版信息

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:26140-26150. doi: 10.1109/cvpr52733.2024.02470. Epub 2024 Sep 16.

Abstract

Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.

摘要

微调预训练的视觉语言模型,如CLIP,已在各种下游任务中取得成功。然而,这种范式仍存在几个痛点:(i)直接微调整个预训练模型既耗时又计算成本高。此外,这些微调后的模型往往变得高度专业化,限制了它们在实际部署中的实用性;(ii)最近的研究表明,预训练的视觉语言分类器可能过度依赖虚假特征——与训练数据中的目标相关但与真实标签函数无关的模式;(iii)现有的关于减轻对虚假特征依赖的研究,很大程度上基于我们可以识别此类特征的假设,并未为实际应用提供明确保证。作为一项试点研究,这项工作专注于探索在不使用任何组注释的情况下减轻CLIP对虚假特征的依赖。为此,我们系统地研究了CLIP和CLIP+ERM上虚假相关性的存在。我们首先,遵循最近关于深度特征重加权(DFR)的工作,验证最后一层再训练可以大大提高预训练CLIP上的组鲁棒性。鉴于此,我们提倡一种用于微调CLIP的轻量级表示校准方法,即首先使用预训练的CLIP生成一个校准集,然后通过对比学习在校准集内校准样本的表示,所有这些都无需组标签。在几个基准上进行的广泛实验和深入可视化验证了我们提议的有效性,大大减少了依赖并显著提高了模型的泛化能力。我们的代码将在此处提供。

相似文献

1
Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.校准多模态表示:一种无需标注追求群体鲁棒性的方法。
Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:26140-26150. doi: 10.1109/cvpr52733.2024.02470. Epub 2024 Sep 16.
2
Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning.通过语义感知微调增强少样本CLIP
IEEE Trans Neural Netw Learn Syst. 2024 Aug 26;PP. doi: 10.1109/TNNLS.2024.3443394.
3
CLIP knows image aesthetics.CLIP了解图像美学。
Front Artif Intell. 2022 Nov 25;5:976235. doi: 10.3389/frai.2022.976235. eCollection 2022.
10

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验