← Back to Glossary

Mixture of Experts (MoE)

Mixture of Experts activates only a subset of the model per input — trading compute efficiency for specialized capacity. It's the architecture behind several of the most capable frontier models, and the reason model size and inference cost are no longer the same number.

What is Mixture of Experts?

Mixture of Experts (MoE) is a neural network architecture in which the model is divided into many specialized sub-networks called “experts,” and a learned routing mechanism determines which experts are activated for any given input. Rather than routing all inputs through all parameters — as a standard “dense” model does — an MoE model selects only a small subset of experts for each token or sequence, dramatically reducing the compute required per inference while maintaining a large total parameter count.

The insight behind MoE is that not all inputs require the same type of processing. A question about tax law should activate different learned representations than a question about image composition. By learning which inputs correspond to which experts, the model can develop specialized capacity for different domains or task types — and during inference, only the relevant subset of the model is engaged.

MoE has become central to frontier AI development. Google’s Gemini models use MoE architecture. Mistral’s Mixtral models are MoE-based. There is strong evidence that OpenAI’s GPT-4 uses a mixture-of-experts design. The architecture is increasingly the default for models that need to be both large (for capability) and efficient (for practical deployment).

How MoE Works

An MoE layer replaces (or supplements) the standard feed-forward network in a transformer with two components:

  • Expert networks: A set of N feed-forward neural networks (typically 8–64 or more), each with the same architecture but different learned weights. Each expert develops different specializations through training.
  • Router (gating network): A small learned network that takes each token as input and outputs a probability distribution over the experts, selecting the top-K experts (typically 2) to process that token. The token’s output is a weighted combination of the selected experts’ outputs.

During training, the router and the experts are trained jointly — the experts learn to be useful for the tokens routed to them, and the router learns which experts handle which inputs best. A key engineering challenge is “load balancing”: ensuring that tokens are distributed roughly evenly across experts, rather than all routing to the same few experts and leaving most capacity idle. Various auxiliary loss terms are used during training to encourage balanced routing.

MoE vs Dense Models

The core trade-off between MoE and dense (fully connected) models:

  • Parameters vs compute: A dense model with 70 billion parameters activates all 70B parameters for every input token. An MoE model with 140B total parameters might activate only 20B per token — maintaining the total capacity of a large model while using the compute of a smaller one.
  • Inference efficiency: MoE models can deliver dense-model-level performance at significantly lower inference cost. This is what makes them practical for large-scale deployment.
  • Memory requirements: The flip side is that all expert weights must be loaded into memory, even though only a fraction are used per inference. MoE models require more memory than a dense model of equivalent compute efficiency.
  • Training stability: MoE models are harder to train. Routing instability — where the model collapses to using only a few experts — is a known failure mode that requires careful tuning to avoid.

Why MoE Matters for Scaling

The AI industry’s scaling law research showed that larger models consistently outperform smaller ones. MoE matters because it decouples two dimensions of scale that were previously coupled: the number of parameters (which determines capacity and capability) and the compute per inference (which determines cost and latency).

Before MoE at scale, increasing model capability necessarily increased inference cost proportionally. MoE breaks this constraint — a model can have the parameter count (and learned capacity) of a 1T-parameter model while using the compute of a 100B parameter model during inference. This is why the architecture has become essential to the economics of frontier AI: it enables providers to offer increasingly capable models without proportionally increasing inference costs.

For founders and operators evaluating AI models for their applications, MoE is mostly an implementation detail — you evaluate models by their benchmarks and practical performance, not their architecture. But understanding why capable models can be more affordable than their size would suggest helps set realistic expectations about the direction of AI pricing and capability over time.

Related Terms and Concepts

Scalability, SaaS, Disruption, Disruptive Technology