Overview

Knowledge distillation allows us to transfer the 'intelligence' of a massive model (like GPT-4) into a much smaller, more efficient model that can run on a phone or laptop.

How it Works

The student model is trained not just on the final labels, but on the 'soft targets' (the full probability distribution) produced by the teacher. This provides more information about the relationships between classes.

Applications

  • Creating fast, lightweight versions of BERT (e.g., DistilBERT).
  • Deploying high-quality AI to edge devices.

Related Terms