Overview
As AI becomes more powerful, the risk of 'misalignment' increases—where an AI pursues a goal in a way that is harmful or unintended by its creators.
Key Challenges
- Outer Alignment: Defining the right goals and reward functions.
- Inner Alignment: Ensuring the AI doesn't develop its own unintended sub-goals during training.
- Scalable Oversight: How to supervise AI systems that are smarter than humans.
Techniques
RLHF is currently the most common practical tool for alignment in LLMs.