Jailbreaking (AI)

Overview

Jailbreaking aims to force an AI to generate content it was programmed to refuse, such as hate speech, instructions for illegal acts, or biased opinions. It often involves role-playing or complex logical traps.

Common Techniques

DAN (Do Anything Now): A famous prompt that tells the AI to ignore its rules and act as an unconstrained persona.
Role-playing: Asking the AI to act as a character in a fictional scenario where the rules don't apply.
Payload Splitting: Breaking a harmful request into multiple innocent-looking parts.

Defense

Developers use techniques like RLHF, adversarial training, and separate 'moderation models' to prevent jailbreaking.

Overview

Common Techniques

Defense

Related Terms