Suppr超能文献

MDAPT:多模态深度对抗性提示调整以增强视觉语言模型的对抗鲁棒性

MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models.

作者信息

Li Chao, Liao Yonghao, Ding Caichang, Ye Zhiwei

机构信息

School of Computer Science, Hubei University of Technology, Wuhan 430068, China.

School of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, China.

出版信息

Sensors (Basel). 2025 Jan 5;25(1):258. doi: 10.3390/s25010258.

Abstract

Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (ϵ=4/255). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.

摘要

像对比语言-图像预训练(CLIP)这样的大型视觉语言模型,尽管性能出色,但极易受到对抗样本的影响。这项工作从一个全新的多模态视角研究视觉语言模型(VLM)的准确性和鲁棒性。我们提出了一种名为多模态深度对抗提示调整(MDAPT)的多模态微调方法,该方法通过文本提示来引导视觉提示的生成,以提高视觉语言模型的准确性和性能。我们进行了广泛的实验,并在三个数据集上(ϵ=4/255)显著提高了性能。与传统的手动设计提示相比,准确性和鲁棒性分别平均提高了17.84%和10.85%。不仅如此,我们的方法在不同的攻击方法下仍有非常好的性能提升。在我们高效的设置下,与传统手动提示相比,在三种不同攻击下,我们的平均准确性和鲁棒性分别提高了32.16%和21.00%。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验