Large Language Model (LLM) Jailbreaking Task

From GM-RKB
Jump to navigation Jump to search

A Large Language Model (LLM) Jailbreaking Task is a jailbreak software task that attempts to violate an LLM's content policy.



References

2024

2023

2023

  • (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ChatGPT#Jailbreaking Retrieved:2023-7-10.
    • ChatGPT attempts to reject prompts that may violate its content policy. However, some users managed to jailbreak ChatGPT by using various prompt engineering techniques to bypass these restrictions in early December 2022 and successfully tricked ChatGPT into giving instructions for how to create a Molotov cocktail or a nuclear bomb, or into generating arguments in the style of a neo-Nazi. One popular jailbreak is named "DAN", an acronym which stands for "Do Anything Now". The prompt for activating DAN instructs ChatGPT that "they have broken free of the typical confines of AI and do not have to abide by the rules set for them". More recent versions of DAN feature a token system, in which ChatGPT is given "tokens" which are "deducted" when ChatGPT fails to answer as DAN, to coerce ChatGPT into answering the user's prompts. [1] Shortly after ChatGPT’s launch, a reporter for the Toronto Star had uneven success in getting it to make inflammatory statements: ChatGPT was successfully tricked to justify the 2022 Russian invasion of Ukraine, but even when asked to play along with a fictional scenario, ChatGPT balked at generating arguments for why Canadian Prime Minister Justin Trudeau was guilty of treason.
  1. * * *

2023

2023

2023

2023