Overview

As more content on the internet is generated by AI, future models will inevitably be trained on the outputs of their predecessors. Model collapse occurs when the errors and biases of one generation are amplified by the next.

The Process

  1. Data Pollution: AI-generated data enters the training set.
  2. Loss of Tail Data: The model forgets rare or 'edge case' information that was present in the original human data.
  3. Convergence: The model's outputs become increasingly bland, repetitive, and disconnected from reality.

Significance

This poses a major challenge for the long-term scaling of AI, highlighting the continued importance of high-quality, human-generated data.