校准多模态表示：一种无需标注追求群体鲁棒性的方法。

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.

作者信息

You Chenyu, Min Yifei, Dai Weicheng, Sekhon Jasjeet S, Staib Lawrence, Duncan James S

机构信息

Yale University.

出版信息

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:26140-26150. doi: 10.1109/cvpr52733.2024.02470. Epub 2024 Sep 16.

DOI:10.1109/cvpr52733.2024.02470

PMID:39640960

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11620289/

Abstract

Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features - patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CLIP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization. Our codes will be available in here.

摘要

微调预训练的视觉语言模型，如CLIP，已在各种下游任务中取得成功。然而，这种范式仍存在几个痛点：（i）直接微调整个预训练模型既耗时又计算成本高。此外，这些微调后的模型往往变得高度专业化，限制了它们在实际部署中的实用性；（ii）最近的研究表明，预训练的视觉语言分类器可能过度依赖虚假特征——与训练数据中的目标相关但与真实标签函数无关的模式；（iii）现有的关于减轻对虚假特征依赖的研究，很大程度上基于我们可以识别此类特征的假设，并未为实际应用提供明确保证。作为一项试点研究，这项工作专注于探索在不使用任何组注释的情况下减轻CLIP对虚假特征的依赖。为此，我们系统地研究了CLIP和CLIP+ERM上虚假相关性的存在。我们首先，遵循最近关于深度特征重加权（DFR）的工作，验证最后一层再训练可以大大提高预训练CLIP上的组鲁棒性。鉴于此，我们提倡一种用于微调CLIP的轻量级表示校准方法，即首先使用预训练的CLIP生成一个校准集，然后通过对比学习在校准集内校准样本的表示，所有这些都无需组标签。在几个基准上进行的广泛实验和深入可视化验证了我们提议的有效性，大大减少了依赖并显著提高了模型的泛化能力。我们的代码将在此处提供。

相似文献

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations.校准多模态表示：一种无需标注追求群体鲁棒性的方法。

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Jun;2024:26140-26150. doi: 10.1109/cvpr52733.2024.02470. Epub 2024 Sep 16.

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning.通过语义感知微调增强少样本CLIP

IEEE Trans Neural Netw Learn Syst. 2024 Aug 26;PP. doi: 10.1109/TNNLS.2024.3443394.

CLIP knows image aesthetics.CLIP了解图像美学。

Front Artif Intell. 2022 Nov 25;5:976235. doi: 10.3389/frai.2022.976235. eCollection 2022.

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning.CLEFT：基于高效大语言模型和提示微调的语言-图像对比学习

Med Image Comput Comput Assist Interv. 2024 Oct;15012:465-475. doi: 10.1007/978-3-031-72390-2_44. Epub 2024 Oct 23.

MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models.MDAPT：多模态深度对抗性提示调整以增强视觉语言模型的对抗鲁棒性

Sensors (Basel). 2025 Jan 5;25(1):258. doi: 10.3390/s25010258.

Proto-Adapter: Efficient Training-Free CLIP-Adapter for Few-Shot Image Classification.Proto-Adapter：用于少样本图像分类的高效无需训练的CLIP-Adapter

Sensors (Basel). 2024 Jun 4;24(11):3624. doi: 10.3390/s24113624.

Synth-CLIP: Synthetic data make CLIP generalize better in data-limited scenarios.合成CLIP：合成数据使CLIP在数据有限的场景中泛化能力更强。

Neural Netw. 2025 Apr;184:107083. doi: 10.1016/j.neunet.2024.107083. Epub 2024 Dec 30.

Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data.弥合病理学领域差距：利用有限的标注数据有效适配CLIP用于病理学图像分析

Comput Vis ECCV. 2025;15122:256-273. doi: 10.1007/978-3-031-73039-9_15. Epub 2024 Oct 31.

CrackCLIP: Adapting Vision-Language Models for Weakly Supervised Crack Segmentation.CrackCLIP：使视觉语言模型适用于弱监督裂缝分割

Entropy (Basel). 2025 Jan 25;27(2):127. doi: 10.3390/e27020127.

MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model.MCPL：用于医学视觉语言模型的多模态协作提示学习

IEEE Trans Med Imaging. 2024 Dec;43(12):4224-4235. doi: 10.1109/TMI.2024.3418408. Epub 2024 Dec 2.

引用本文的文献

Robust thoracic CT image registration with environmental adaptability using dynamic Welsch's function and hierarchical structure-awareness strategy.使用动态韦尔施函数和层次结构感知策略实现具有环境适应性的稳健胸部CT图像配准

Quant Imaging Med Surg. 2024 Dec 5;14(12):8999-9020. doi: 10.21037/qims-24-596. Epub 2024 Nov 29.

本文引用的文献

Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations.人工智能算法应用于服务不足患者人群的胸部 X 光片时的漏诊偏倚。

Nat Med. 2021 Dec;27(12):2176-2182. doi: 10.1038/s41591-021-01595-0. Epub 2021 Dec 10.

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging.隐藏分层导致医学成像机器学习中具有临床意义的失败。

Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:151-159. doi: 10.1145/3368555.3384468.

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.深度学习模型检测胸片肺炎的可变泛化性能：一项横断面研究。

PLoS Med. 2018 Nov 6;15(11):e1002683. doi: 10.1371/journal.pmed.1002683. eCollection 2018 Nov.

Places: A 10 Million Image Database for Scene Recognition.地点：用于场景识别的 1000 万图像数据库。

IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1452-1464. doi: 10.1109/TPAMI.2017.2723009. Epub 2017 Jul 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验