Overview

Jailbreaking aims to force an AI to generate content it was programmed to refuse, such as hate speech, instructions for illegal acts, or biased opinions. It often involves role-playing or complex logical traps.

Common Techniques

  • DAN (Do Anything Now): A famous prompt that tells the AI to ignore its rules and act as an unconstrained persona.
  • Role-playing: Asking the AI to act as a character in a fictional scenario where the rules don't apply.
  • Payload Splitting: Breaking a harmful request into multiple innocent-looking parts.

Defense

Developers use techniques like RLHF, adversarial training, and separate 'moderation models' to prevent jailbreaking.

Related Terms