Overview

As AI becomes more powerful, the risk of 'misalignment' increases—where an AI pursues a goal in a way that is harmful or unintended by its creators.

Key Challenges

  • Outer Alignment: Defining the right goals and reward functions.
  • Inner Alignment: Ensuring the AI doesn't develop its own unintended sub-goals during training.
  • Scalable Oversight: How to supervise AI systems that are smarter than humans.

Techniques

RLHF is currently the most common practical tool for alignment in LLMs.

Related Terms