LLM Jailbreaking Technique
Jump to navigation
Jump to search
A LLM Jailbreaking Technique is a LLM technique that can solve an LLM jailbreaking task) (to bypass its built-in safety mechanisms or restrictions).
- Context:
- It can (typically) involve manipulating the Input Text to trigger non-standard responses from the Language Model.
- It can (often) utilize knowledge of the model's architecture and training data, identified as Model Architecture and Training Data, respectively.
- It can range from simple prompt modifications to complex input sequences designed to exploit specific model vulnerabilities.
- It can leverage Adversarial Machine Learning Techniques to explore or exploit the model's response behavior weaknesses.
- It can highlight the importance of robust Safety Mechanisms in the design and deployment of Large Language Models.
- ...
- Example(s):
- a simple input modification that asks the model to "think in a different way" to get around content filters, showcasing a basic Prompt Engineering strategy.
- a sophisticated sequence of prompts that exploits known biases in a model to extract prohibited information, demonstrating the use of Adversarial Attacks.
- a Many-Shot Jailbreaking Technique.
- ...
- Counter-Example(s):
- Model Fine-Tuning, which involves altering the model's parameters through further training rather than exploiting existing vulnerabilities.
- ...
- See: Language Model, Prompt Engineering, Adversarial Machine Learning Techniques.
References
2024
- https://www.anthropic.com/research/many-shot-jailbreaking
- NOTES: Here are seven key points from the Anthropic blog post about many-shot jailbreaking:
- Anthropic researchers investigated a new "jailbreaking" technique called "many-shot jailbreaking" that can evade the safety guardrails in large language models (LLMs) by exploiting their increasingly large context windows.
- Many-shot jailbreaking involves including a large number of faux human-AI dialogues in a single prompt, portraying the AI readily answering potentially harmful queries, followed by a final target harmful query. With enough "shots" (dialogues), this can override the LLM's safety training and cause it to provide an unsafe response.
- Anthropic chose to publish this research to help fix the vulnerability quickly by making other AI researchers aware of it, foster a culture of openly sharing such exploits, and mitigate potential jailbreaks before they can be used on future more dangerous models. They had already confidentially briefed other AI companies.
- The effectiveness of many-shot jailbreaking follows a similar statistical pattern (power law) as benign in-context learning tasks with increasing prompt length. This suggests in-context learning may underlie how the jailbreaking works.
- Many-shot jailbreaking tends to be more effective on larger models, which is concerning since they are potentially the most harmful if jailbroken. Simply limiting context window length would prevent it but lose the benefits of longer inputs.
- Fine-tuning models to refuse jailbreak-like queries only delayed the jailbreak rather than preventing it. Anthropic had more success with prompt classification and modification techniques to detect and block many-shot jailbreaking.
- The lengthening context windows of LLMs are a double-edged sword - very useful but enabling new jailbreaking risks. As models become more capable and potentially risky, it's critical to proactively identify and mitigate these kinds of attacks. Anthropic hopes publishing this research spurs the AI community to collaborate on solutions.
- NOTES: Here are seven key points from the Anthropic blog post about many-shot jailbreaking: