Overview
In an MoE model, only a small fraction of the total parameters are 'active' for any given input. A 'router' decides which 'experts' (sub-networks) are best suited to process the current token.
Benefits
- Efficiency: Allows for models with massive total parameters (e.g., 1 trillion) that run as fast as much smaller models.
- Performance: Different experts can specialize in different domains (e.g., math, coding, creative writing).
Notable Examples
- Mixtral 8x7B
- GPT-4 (rumored to be an MoE)