Mixture of Experts (MoE)

What Is Mixture of Experts?

Mixture of Experts (MoE) is a neural network architecture that divides a model into multiple specialized sub-networks called experts, along with a gating mechanism that routes each input to only a subset of these experts. This design allows models to have enormous total parameter counts while activating only a fraction of them for any given input, dramatically reducing computational costs during inference.

Advantages of MoE Architecture

In modern MoE language models, each Transformer layer contains multiple feed-forward expert networks. A learned router examines each token and selects the top-k experts (typically 2 out of 8 or more) to process it. The outputs from the selected experts are weighted and combined. This sparse activation means a model with hundreds of billions of total parameters may only use a fraction of them per forward pass.

Deployment Considerations

MoE models achieve performance comparable to dense models many times their active parameter count. They train faster because each expert can specialize in different types of knowledge or tasks. The architecture also scales efficiently, as adding more experts increases model capacity without proportionally increasing inference cost.

What Is Mixture of Experts?

Advantages of MoE Architecture

Deployment Considerations

Relaterede termer