Overview
Developed by OpenAI, PPO is known for being stable, reliable, and easier to tune than previous RL algorithms. It is the default choice for many RL tasks today.
How it Works
PPO ensures that the model's policy doesn't change too drastically in a single update. It uses a 'clipped' objective function to keep the updates within a safe range (the 'proximal' part).
Use in LLMs
PPO is the core algorithm used in the final stage of RLHF to fine-tune models like ChatGPT to be more helpful and safe.