Suppr超能文献

通过学习注入知识来适配视觉语言模型。

Adapting Vision-Language Models via Learning to Inject Knowledge.

作者信息

Xuan Shiyu, Yang Ming, Zhang Shiliang

出版信息

IEEE Trans Image Process. 2024;33:5798-5809. doi: 10.1109/TIP.2024.3468884. Epub 2024 Oct 15.

Abstract

Pre-trained vision-language models (VLM) such as CLIP, have demonstrated impressive zero-shot performance on various vision tasks. Trained on millions or even billions of image-text pairs, the text encoder has memorized a substantial amount of appearance knowledge. Such knowledge in VLM is usually leveraged by learning specific task-oriented prompts, which may limit its performance in unseen tasks. This paper proposes a new knowledge injection framework to pursue a generalizable adaption of VLM to downstream vision tasks. Instead of learning task-specific prompts, we extract task-agnostic knowledge features, and insert them into features of input images or texts. The fused features hence gain better discriminative capability and robustness to intra-category variances. Those knowledge features are generated by inputting learnable prompt sentences into text encoder of VLM, and extracting its multi-layer features. A new knowledge injection module (KIM) is proposed to refine text features or visual features using knowledge features. This knowledge injection framework enables both modalities to benefit from the rich knowledge memorized in the text encoder. Experiments show that our method outperforms recently proposed methods under few-shot learning, base-to-new classes generalization, cross-dataset transfer, and domain generalization settings. For instance, it outperforms CoOp by 4.5% under the few-shot learning setting, and CoCoOp by 4.4% under the base-to-new classes generalization setting. Our code will be released.

摘要

诸如CLIP等预训练视觉语言模型(VLM)在各种视觉任务上展现出了令人印象深刻的零样本性能。在数百万甚至数十亿的图像-文本对上进行训练后,文本编码器已经记住了大量的外观知识。VLM中的此类知识通常通过学习特定任务的提示来加以利用,这可能会限制其在未见任务中的性能。本文提出了一种新的知识注入框架,以实现VLM对下游视觉任务的可泛化适应。我们不是学习特定任务的提示,而是提取与任务无关的知识特征,并将其插入到输入图像或文本的特征中。融合后的特征因此获得了更好的判别能力和对类别内差异的鲁棒性。这些知识特征是通过将可学习的提示语句输入到VLM的文本编码器中,并提取其多层特征而生成的。提出了一种新的知识注入模块(KIM),以使用知识特征来细化文本特征或视觉特征。这种知识注入框架使两种模态都能从文本编码器中存储的丰富知识中受益。实验表明,在少样本学习、基类到新类泛化、跨数据集迁移和域泛化设置下,我们的方法优于最近提出的方法。例如,在少样本学习设置下,它比CoOp高出4.5%,在基类到新类泛化设置下比CoCoOp高出4.4%。我们的代码将予以发布。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验