Overview

Developed by OpenAI, PPO is known for being stable, reliable, and easier to tune than previous RL algorithms. It is the default choice for many RL tasks today.

How it Works

PPO ensures that the model's policy doesn't change too drastically in a single update. It uses a 'clipped' objective function to keep the updates within a safe range (the 'proximal' part).

Use in LLMs

PPO is the core algorithm used in the final stage of RLHF to fine-tune models like ChatGPT to be more helpful and safe.

Related Terms