Overview

In an MoE model, only a small fraction of the total parameters are 'active' for any given input. A 'router' decides which 'experts' (sub-networks) are best suited to process the current token.

Benefits

  • Efficiency: Allows for models with massive total parameters (e.g., 1 trillion) that run as fast as much smaller models.
  • Performance: Different experts can specialize in different domains (e.g., math, coding, creative writing).

Notable Examples

  • Mixtral 8x7B
  • GPT-4 (rumored to be an MoE)

Related Terms