Mixture of Experts

Overview

In an MoE model, only a small fraction of the total parameters are 'active' for any given input. A 'router' decides which 'experts' (sub-networks) are best suited to process the current token.

Benefits

Efficiency: Allows for models with massive total parameters (e.g., 1 trillion) that run as fast as much smaller models.
Performance: Different experts can specialize in different domains (e.g., math, coding, creative writing).

Notable Examples

Mixtral 8x7B
GPT-4 (rumored to be an MoE)

Overview

Benefits

Notable Examples

Related Terms