LLM Jailbreaking Technique

From GM-RKB
Jump to navigation Jump to search

A LLM Jailbreaking Technique is a LLM technique that can solve an LLM jailbreaking task) (to bypass its built-in safety mechanisms or restrictions).



References

2024

  • https://www.anthropic.com/research/many-shot-jailbreaking
    • NOTES: Here are seven key points from the Anthropic blog post about many-shot jailbreaking:
      1. Anthropic researchers investigated a new "jailbreaking" technique called "many-shot jailbreaking" that can evade the safety guardrails in large language models (LLMs) by exploiting their increasingly large context windows.
      2. Many-shot jailbreaking involves including a large number of faux human-AI dialogues in a single prompt, portraying the AI readily answering potentially harmful queries, followed by a final target harmful query. With enough "shots" (dialogues), this can override the LLM's safety training and cause it to provide an unsafe response.
      3. Anthropic chose to publish this research to help fix the vulnerability quickly by making other AI researchers aware of it, foster a culture of openly sharing such exploits, and mitigate potential jailbreaks before they can be used on future more dangerous models. They had already confidentially briefed other AI companies.
      4. The effectiveness of many-shot jailbreaking follows a similar statistical pattern (power law) as benign in-context learning tasks with increasing prompt length. This suggests in-context learning may underlie how the jailbreaking works.
      5. Many-shot jailbreaking tends to be more effective on larger models, which is concerning since they are potentially the most harmful if jailbroken. Simply limiting context window length would prevent it but lose the benefits of longer inputs.
      6. Fine-tuning models to refuse jailbreak-like queries only delayed the jailbreak rather than preventing it. Anthropic had more success with prompt classification and modification techniques to detect and block many-shot jailbreaking.
      7. The lengthening context windows of LLMs are a double-edged sword - very useful but enabling new jailbreaking risks. As models become more capable and potentially risky, it's critical to proactively identify and mitigate these kinds of attacks. Anthropic hopes publishing this research spurs the AI community to collaborate on solutions.