Mixture of Experts (MoE)
Overview
Mixture of Experts (MoE) dramatically scales model capacity through sparse activation without proportionally increasing computation. From the Switch Transformer to Mixtral, DeepSeek-MoE, and Llama 4, MoE has become the mainstream architecture for scaling large models.
graph TD
A[MoE Architecture] --> B[Gating Network]
A --> C[Expert Networks]
A --> D[Load Balancing]
B --> B1[Top-K Gating]
B --> B2[Noisy Top-K]
B --> B3[Expert Choice]
C --> C1[Dense FFN]
C --> C2[Shared + Routed Experts]
D --> D1[Auxiliary Loss]
D --> D2[Expert Parallelism]
subgraph Representative Models
E1[Switch Transformer]
E2[Mixtral 8x7B]
E3[DeepSeek-MoE]
E4[Llama 4 MoE]
E5[Jamba]
end
1. MoE Fundamentals
1.1 Core Idea
In a standard Transformer, every token passes through all FFN parameters. MoE replaces the FFN with multiple "experts," where each token activates only a subset of experts.
Sparse Gating Function:
MoE Layer Output:
where:
- \(N\): Total number of experts
- \(E_i\): The \(i\)-th expert network (typically an FFN)
- \(G(x)_i\): Weight assigned to the \(i\)-th expert by the gating network
- TopK: Retains only the top K values, setting the rest to 0
1.2 Key Formulas
Noisy Top-K Gating (Shazeer et al., 2017):
Adding noise helps with exploration and load balancing.
1.3 Computational Efficiency
| Model Type | Total Params | Active Params per Token | Compute |
|---|---|---|---|
| Dense 7B | 7B | 7B | 1x |
| MoE 8x7B (Top-2) | 47B | ~13B | ~2x Dense 7B |
| MoE 16x7B (Top-2) | 100B+ | ~13B | ~2x Dense 7B |
Core Advantage: Parameters scale dramatically, but per-token computation depends only on the number of activated experts.
2. Switch Transformer
2.1 Design
Switch Transformer (Fedus et al., 2022): Top-1 routing, each token selects only one expert.
Simplified Routing:
Design Principles:
- Top-1 is simpler and more efficient than Top-2
- Large number of experts (128, 256)
- Limited expert capacity (expert capacity factor)
2.2 Capacity Factor
where \(n\) is the number of tokens and \(N\) is the number of experts. Tokens exceeding capacity are dropped (overflow).
2.3 Load Balancing Loss
where:
- \(f_i\): Fraction of tokens routed to expert \(i\)
- \(P_i\): Average probability assigned to expert \(i\) by the gating network
This loss encourages both \(f_i\) and \(P_i\) to approach uniform \(1/N\).
3. Mixtral
3.1 Mixtral 8x7B
Mixtral (Jiang et al., 2024): A milestone in open-source MoE.
Architecture:
- 8 experts, each approximately 7B parameters (the FFN portion)
- Top-2 routing
- Total ~47B parameters, ~13B active
- Performance matches LLaMA 2 70B
Routing Analysis:
- Different layers exhibit different routing patterns
- Some experts are more active for certain domains/languages
- But experts lack clear "specialized" topics
3.2 Mixtral 8x22B
- Larger experts (22B scale)
- Total ~176B parameters
- Further performance improvements
4. DeepSeek-MoE
4.1 Fine-Grained Experts
DeepSeek-MoE: Further subdivides experts.
Architecture:
- 64 routed experts + 2 shared experts
- Top-6 routing
- Each expert is smaller (fine-grained)
Shared Experts:
Shared experts are always activated, capturing universal knowledge; routed experts handle specialized knowledge.
4.2 DeepSeek-V2/V3
DeepSeek-V2:
- MLA (Multi-Head Latent Attention) reduces KV cache
- DeepSeekMoE architecture: 160 routed + 2 shared experts
- 21B active per token / 236B total parameters
DeepSeek-V3:
- 671B total parameters, 37B active per token
- 256 routed experts + 1 shared expert
- Auxiliary-loss-free load balancing strategy
5. Llama 4 MoE
5.1 Llama 4 Scout & Maverick
Llama 4 (Meta, 2025) adopts MoE architecture:
- Scout: 16 experts, 1 active per token
- Maverick: 128 experts, 1 active per token
Features:
- Interleaved attention layers (not every layer has MoE)
- iRoPE: Alternating attention layers with/without positional encoding
- Native multimodal: Supports images and text
6. Jamba: MoE + SSM
6.1 Hybrid Architecture
Jamba (AI21, 2024): Combines Transformer attention, Mamba SSM, and MoE.
Architecture:
Layer 1-2: Mamba + MoE
Layer 3: Attention + MoE
Layer 4-5: Mamba + MoE
Layer 6: Attention + MoE
...
Advantages:
- Mamba efficiently handles long sequences
- Attention maintains global modeling capability
- MoE expands model capacity
- 52B total parameters, 12B active, supports 256K context
7. Training Challenges
7.1 Load Imbalance
Problem: Some experts are selected too frequently (hot experts), others almost never.
Solutions:
| Solution | Description |
|---|---|
| Auxiliary loss | Add balancing loss to encourage uniform allocation |
| Expert Choice | Experts choose tokens rather than tokens choosing experts |
| Capacity limits | Limit tokens processed per expert |
| Random routing | Add randomness |
7.2 Expert Collapse
Problem: Some experts gradually degrade and stop being selected, forming "dead experts."
Mitigation:
- Ensure expert diversity at initialization
- Periodically reinitialize cold experts
- Use stronger regularization
- DeepSeek-V3's auxiliary-loss-free load balancing
7.3 Communication Overhead
Expert Parallelism: Different experts distributed across different GPUs.
- Each MoE layer requires AllToAll communication (dispatching tokens to corresponding expert GPUs)
- Communication volume scales with number of active experts and tokens
- Requires high-speed interconnects (NVLink, InfiniBand)
graph LR
subgraph GPU0
T1[Tokens] --> E1[Expert 0]
T1 --> E2[Expert 1]
end
subgraph GPU1
T2[Tokens] --> E3[Expert 2]
T2 --> E4[Expert 3]
end
T1 -.->|AllToAll| E3
T1 -.->|AllToAll| E4
T2 -.->|AllToAll| E1
T2 -.->|AllToAll| E2
8. MoE vs Dense Models
| Dimension | Dense | MoE |
|---|---|---|
| Parameter efficiency | Low (all activated) | High (sparse activation) |
| Training FLOPs | High | Lower (at same performance) |
| Inference speed | Stable | Affected by routing and communication |
| Memory requirement | \(\Phi\) | \(N \times \Phi_e\) (must store all experts) |
| Fine-tuning | Simple | Complex (routing stability) |
| Training stability | Good | Requires extra techniques |
| Scalability | Linear | Sub-linear compute growth |
9. Summary
| Model | Year | Experts | TopK | Total Params | Active Params |
|---|---|---|---|---|---|
| Switch Transformer | 2022 | 128 | 1 | - | - |
| Mixtral 8x7B | 2024 | 8 | 2 | 47B | 13B |
| DeepSeek-V2 | 2024 | 160+2 | 6 | 236B | 21B |
| DeepSeek-V3 | 2024 | 256+1 | 8 | 671B | 37B |
| Llama 4 Scout | 2025 | 16 | 1 | - | 17B |
| Jamba | 2024 | 16 | 2 | 52B | 12B |
Key Trends:
- From coarse-grained to fine-grained experts
- Hybrid design with shared + routed experts
- Fusion of MoE with SSM (Jamba)
- Smarter load balancing strategies
- Native multimodal MoE
References
- Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017
- Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," JMLR 2022
- Jiang et al., "Mixtral of Experts," 2024
- Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models," ACL 2024
- Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model," 2024