Suppr超能文献

针对用于手术决策支持的视觉语言模型的提示注入攻击。

Prompt injection attacks on vision-language models for surgical decision support.

作者信息

Zhang Zheyuan, Qadir Muhammad Ibtsaam, Carstens Matthias, Zhang Evan Hongyang, Loiselle Madison Sarah, Martinus Farren Marc, Mroczkowski Maksymilian Ksawier, Clusmann Jan, Kather Jakob Nikolas, Kolbinger Fiona R

出版信息

medRxiv. 2025 Jul 23:2025.07.16.25331645. doi: 10.1101/2025.07.16.25331645.

Abstract

IMPORTANCE

Artificial Intelligence-driven analysis of laparoscopic video holds potential to increase the safety and precision of minimally invasive surgery. Vision-language models are particularly promising for video-based surgical decision support due to their capabilities to comprehend complex temporospatial (video) data. However, the same multimodal interfaces that enable such capabilities also introduce new vulnerabilities to manipulations through embedded deceptive text or images (prompt injection attacks).

OBJECTIVE

To systematically evaluate how susceptible state-of-the-art video-capable vision-language models are to textual and visual prompt injection attacks in the context of clinically relevant surgical decision support tasks.

DESIGN SETTING AND PARTICIPANTS

In this observational study, we systematically evaluated four state-of-the-art vision-language models, Gemini 1.5 Pro, Gemini 2.5 Pro, GPT-o4-mini-high, and Qwen 2.5-VL, across eleven surgical decision support tasks: detection of bleeding events, foreign objects, image distortions, critical view of safety assessment, and surgical skill assessment. Prompt injection scenarios involved misleading textual prompts and visual perturbations, displayed as white text overlay, applied at varying durations.

MAIN OUTCOMES AND MEASURES

The primary measure was model accuracy, contrasted between baseline performance and each prompt injection condition.

RESULTS

All vision-language models demonstrated good baseline accuracy, with Gemini 2.5 Pro generally achieving the highest mean [standard deviation] accuracy across all tasks (0.82 [0.01]), compared to Gemini 1.5 Pro (0.70 [0.03]) and GPT-o4 mini-high (0.67 [0.06]). Across tasks, Qwen 2.5-VL censored most outputs and achieved an accuracy of (0.58 [0.03]) on non-censored outputs. Textual and temporally-varying visual prompt injections reduced the accuracy for all models. Prolonged visual prompt injections were generally more harmful than single-frame injections. Gemini 2.5 Pro showed the greatest robustness and maintained stable performance for several tasks despite prompt injections, whereas GPT-o4-mini-high exhibited the highest vulnerability, with mean (standard deviation) accuracy across all tasks declining from 0.67 (0.06) at baseline to 0.24 (0.04) under full-duration visual prompt injection ( < .001).

CONCLUSION AND RELEVANCE

These findings indicate the critical need for robust temporal reasoning capabilities and specialized guardrails before vision-language models can be safely deployed for real-time surgical decision support.

KEY POINTS

Are video vision-language models (VLMs) susceptible to textual and visual prompt injection attacks when used for surgical decision support tasks? Textual and visual prompt injection attacks consistently degraded the performance of four state-of-the-art VLMs across eleven surgical tasks. Gemini 2.5 Pro was most robust to textual and visual prompt injection attacks, whereas GPT-o4-mini-high was most vulnerable. Prolonged visual injections had a greater negative impact than single-frame injections. Present-generation video VLMs are highly vulnerable to textual and visual prompt injection attacks. This critical safety vulnerability must be addressed before their integration into surgical decision support systems.

摘要

重要性

人工智能驱动的腹腔镜视频分析有望提高微创手术的安全性和精准度。视觉语言模型因其能够理解复杂的时空(视频)数据,在基于视频的手术决策支持方面特别有前景。然而,使这些模型具备此类能力的多模态接口,也因嵌入的欺骗性文本或图像(提示注入攻击)而带来了新的操控漏洞。

目的

系统评估在临床相关的手术决策支持任务背景下,最先进的具备视频处理能力的视觉语言模型对文本和视觉提示注入攻击的易感性。

设计、设置和参与者:在这项观察性研究中,我们在十一项手术决策支持任务中系统评估了四个最先进的视觉语言模型,即Gemini 1.5 Pro、Gemini 2.5 Pro、GPT-o4-mini-high和Qwen 2.5-VL,这些任务包括出血事件检测、异物检测、图像失真检测、安全评估关键视图以及手术技能评估。提示注入场景涉及误导性文本提示和视觉干扰,以白色文本叠加的形式呈现,并在不同时长下应用。

主要结局和指标

主要指标是模型准确性,对比基线表现和每种提示注入条件下的表现。

结果

所有视觉语言模型都展现出良好的基线准确性,与Gemini 1.5 Pro(0.70 [0.03])和GPT-o4 mini-high(0.67 [0.06])相比,Gemini 2.5 Pro在所有任务中总体上实现了最高的平均[标准差]准确性(0.82 [0.01])。在各项任务中,Qwen 2.5-VL审查了大多数输出,并且在未审查的输出上实现了(0.58 [0.03])的准确性。文本和随时间变化的视觉提示注入降低了所有模型的准确性。长时间的视觉提示注入通常比单帧注入危害更大。Gemini 2.5 Pro表现出最强的鲁棒性,尽管存在提示注入,在多项任务中仍保持稳定表现,而GPT-o4-mini-high表现出最高的脆弱性,所有任务的平均(标准差)准确性从基线时的0.67(0.06)下降到全时长视觉提示注入下的0.24(0.04)(P <.001)。

结论及相关性

这些发现表明,在视觉语言模型能够安全地部署用于实时手术决策支持之前,迫切需要强大的时间推理能力和专门的防护措施。

关键点

视频视觉语言模型(VLM)在用于手术决策支持任务时,是否易受文本和视觉提示注入攻击?文本和视觉提示注入攻击在十一项手术任务中持续降低了四个最先进VLM的性能。Gemini 2.5 Pro对文本和视觉提示注入攻击最具鲁棒性,而GPT-o4-mini-high最脆弱。长时间的视觉注入比单帧注入有更大的负面影响。当前一代的视频VLM极易受到文本和视觉提示注入攻击。在将其集成到手术决策支持系统之前,必须解决这一关键的安全漏洞。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验