Overview
Jailbreaking aims to force an AI to generate content it was programmed to refuse, such as hate speech, instructions for illegal acts, or biased opinions. It often involves role-playing or complex logical traps.
Common Techniques
- DAN (Do Anything Now): A famous prompt that tells the AI to ignore its rules and act as an unconstrained persona.
- Role-playing: Asking the AI to act as a character in a fictional scenario where the rules don't apply.
- Payload Splitting: Breaking a harmful request into multiple innocent-looking parts.
Defense
Developers use techniques like RLHF, adversarial training, and separate 'moderation models' to prevent jailbreaking.