Suppr超能文献

减轻大语言模型中的对抗性操纵:一种基于提示的方法来对抗越狱攻击(Prompt-G)。

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G).

作者信息

Pingua Bhagyajit, Murmu Deepak, Kandpal Meenakshi, Rautaray Jyotirmayee, Mishra Pranati, Barik Rabindra Kumar, Saikia Manob Jyoti

机构信息

School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India.

School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India.

出版信息

PeerJ Comput Sci. 2024 Oct 22;10:e2374. doi: 10.7717/peerj-cs.2374. eCollection 2024.

Abstract

Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM's vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821.

摘要

大语言模型(LLMs)已成为文本生成、自然语言处理和对话式人工智能等领域的变革性工具。然而,它们的广泛使用带来了安全风险,例如越狱攻击,这种攻击利用大语言模型的漏洞来操纵输出或提取敏感信息。恶意行为者可以利用大语言模型传播错误信息、操纵公众舆论并宣扬有害思想,引发了伦理问题。平衡安全性和准确性需要仔细权衡潜在风险与收益。提示防护(Prompt-G)通过使用向量数据库和嵌入技术来评估生成文本的可信度,从而应对这些挑战,实现对恶意内容的实时检测和过滤。我们收集并分析了一个自我提醒攻击的数据集,以识别和缓解越狱攻击,确保大语言模型生成安全准确的回复。在各种攻击场景中,Prompt-G显著降低了越狱成功率,并有效识别出导致大语言模型产生混淆或干扰的提示。将我们的模型与Llama 2 13B聊天模型集成后,攻击成功率(ASR)降至2.08%。源代码可在以下网址获取:https://doi.org/10.5281/zenodo.13501821

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7578/11622839/531ae54f2ba9/peerj-cs-10-2374-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验