Overview
RLHF is a critical step in making LLMs helpful and safe. It involves humans ranking different AI responses, which is then used to train a reward model that guides the AI's behavior.
Process
- Pre-training: Model learns from a large corpus of text.
- Supervised Fine-tuning: Model is trained on human-written demonstrations.
- RLHF: Model is optimized based on human preference rankings.
Goal
To ensure the AI follows instructions, avoids harmful content, and maintains a helpful persona.