Overview

RLHF is a critical step in making LLMs helpful and safe. It involves humans ranking different AI responses, which is then used to train a reward model that guides the AI's behavior.

Process

  1. Pre-training: Model learns from a large corpus of text.
  2. Supervised Fine-tuning: Model is trained on human-written demonstrations.
  3. RLHF: Model is optimized based on human preference rankings.

Goal

To ensure the AI follows instructions, avoids harmful content, and maintains a helpful persona.

Related Terms