多模型保证分析表明，在临床决策支持过程中，大语言模型极易受到对抗性幻觉攻击。

BACKGROUND: Large language models (LLMs) show promise in clinical contexts but can generate false facts (often referred to as "hallucinations"). One subset of these errors arises from adversarial attacks, in which fabricated details embedded in prompts lead the model to produce or elaborate on the false information. We embedded fabricated content in clinical prompts to elicit adversarial hallucination attacks in multiple large language models. We quantified how often they elaborated on false details and tested whether a specialized mitigation prompt or altered temperature settings reduced errors. METHODS: We created 300 physician-validated simulated vignettes, each containing one fabricated detail (a laboratory test, a physical or radiological sign, or a medical condition). Each vignette was presented in short and long versions-differing only in word count but identical in medical content. We tested six LLMs under three conditions: default (standard settings), mitigating prompt (designed to reduce hallucinations), and temperature 0 (deterministic output with maximum response certainty), generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a "hallucination". RESULTS: Hallucination rates range from 50 % to 82 % across models and prompting methods. Prompt-based mitigation lowers the overall hallucination rate (mean across all models) from 66 % to 44 % (p < 0.001). For the best-performing model, GPT-4o, rates decline from 53 % to 23 % (p < 0.001). Temperature adjustments offer no significant improvement. Short vignettes show slightly higher odds of hallucination. CONCLUSIONS: LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards. While prompt engineering reduces errors, it does not eliminate them.

背景：大语言模型（LLMs）在临床环境中显示出应用前景，但可能会生成虚假事实（通常称为“幻觉”）。这些错误的一个子集源于对抗性攻击，即提示中嵌入的虚假细节会导致模型生成或详细阐述虚假信息。我们在临床提示中嵌入虚假内容，以引发多个大语言模型的对抗性幻觉攻击。我们量化了它们详细阐述虚假细节的频率，并测试了专门的缓解提示或改变温度设置是否能减少错误。方法：我们创建了300个经医生验证的模拟病例 vignettes，每个病例包含一个虚假细节（一项实验室检查、一个体格检查或影像学体征，或一种疾病状况）。每个病例 vignette 都有简短版和长篇版，仅字数不同，但医学内容相同。我们在三种条件下测试了六个大语言模型：默认（标准设置）、缓解提示（旨在减少幻觉）和温度0（具有最大响应确定性的确定性输出），共生成5400个输出。如果模型详细阐述了虚假细节，则该病例被分类为“幻觉”。结果：不同模型和提示方法的幻觉率在50%至82%之间。基于提示的缓解措施将总体幻觉率（所有模型的平均值）从66%降至44%（p < 0.001）。对于表现最佳的模型GPT - 4o，幻觉率从53%降至23%（p < 0.001）。温度调整没有显著改善。简短的病例 vignettes 出现幻觉的几率略高。结论：大语言模型极易受到对抗性幻觉攻击，经常生成虚假的临床细节，在没有安全保障的情况下使用时会带来风险。虽然提示工程可以减少错误，但并不能消除它们。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具