Reinforcement Learning from Human Feedback

Overview

RLHF is a critical step in making LLMs helpful and safe. It involves humans ranking different AI responses, which is then used to train a reward model that guides the AI's behavior.

Process

Pre-training: Model learns from a large corpus of text.
Supervised Fine-tuning: Model is trained on human-written demonstrations.
RLHF: Model is optimized based on human preference rankings.

Goal

To ensure the AI follows instructions, avoids harmful content, and maintains a helpful persona.

Overview

Process

Goal

Related Terms