Skip to content

Mixture of Experts (MoE)

Overview

Mixture of Experts (MoE) dramatically scales model capacity through sparse activation without proportionally increasing computation. From the Switch Transformer to Mixtral, DeepSeek-MoE, and Llama 4, MoE has become the mainstream architecture for scaling large models.

graph TD
    A[MoE Architecture] --> B[Gating Network]
    A --> C[Expert Networks]
    A --> D[Load Balancing]

    B --> B1[Top-K Gating]
    B --> B2[Noisy Top-K]
    B --> B3[Expert Choice]

    C --> C1[Dense FFN]
    C --> C2[Shared + Routed Experts]

    D --> D1[Auxiliary Loss]
    D --> D2[Expert Parallelism]

    subgraph Representative Models
    E1[Switch Transformer]
    E2[Mixtral 8x7B]
    E3[DeepSeek-MoE]
    E4[Llama 4 MoE]
    E5[Jamba]
    end

1. MoE Fundamentals

1.1 Core Idea

In a standard Transformer, every token passes through all FFN parameters. MoE replaces the FFN with multiple "experts," where each token activates only a subset of experts.

Sparse Gating Function:

\[ G(x) = \text{TopK}(\text{softmax}(W_g x)) \]

MoE Layer Output:

\[ y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x) \]

where:

  • \(N\): Total number of experts
  • \(E_i\): The \(i\)-th expert network (typically an FFN)
  • \(G(x)_i\): Weight assigned to the \(i\)-th expert by the gating network
  • TopK: Retains only the top K values, setting the rest to 0

1.2 Key Formulas

Noisy Top-K Gating (Shazeer et al., 2017):

\[ H(x) = W_g x + \text{StandardNormal}() \cdot \text{softplus}(W_{\text{noise}} x) \]
\[ G(x) = \text{softmax}(\text{TopK}(H(x), k)) \]

Adding noise helps with exploration and load balancing.

1.3 Computational Efficiency

Model Type Total Params Active Params per Token Compute
Dense 7B 7B 7B 1x
MoE 8x7B (Top-2) 47B ~13B ~2x Dense 7B
MoE 16x7B (Top-2) 100B+ ~13B ~2x Dense 7B

Core Advantage: Parameters scale dramatically, but per-token computation depends only on the number of activated experts.


2. Switch Transformer

2.1 Design

Switch Transformer (Fedus et al., 2022): Top-1 routing, each token selects only one expert.

Simplified Routing:

\[ G(x) = \text{Top1}(\text{softmax}(W_g x)) \]

Design Principles:

  • Top-1 is simpler and more efficient than Top-2
  • Large number of experts (128, 256)
  • Limited expert capacity (expert capacity factor)

2.2 Capacity Factor

\[ \text{Expert Capacity} = \left(\frac{n}{N}\right) \times \text{capacity\_factor} \]

where \(n\) is the number of tokens and \(N\) is the number of experts. Tokens exceeding capacity are dropped (overflow).

2.3 Load Balancing Loss

\[ \mathcal{L}_{\text{balance}} = N \sum_{i=1}^{N} f_i \cdot P_i \]

where:

  • \(f_i\): Fraction of tokens routed to expert \(i\)
  • \(P_i\): Average probability assigned to expert \(i\) by the gating network

This loss encourages both \(f_i\) and \(P_i\) to approach uniform \(1/N\).


3. Mixtral

3.1 Mixtral 8x7B

Mixtral (Jiang et al., 2024): A milestone in open-source MoE.

Architecture:

  • 8 experts, each approximately 7B parameters (the FFN portion)
  • Top-2 routing
  • Total ~47B parameters, ~13B active
  • Performance matches LLaMA 2 70B

Routing Analysis:

  • Different layers exhibit different routing patterns
  • Some experts are more active for certain domains/languages
  • But experts lack clear "specialized" topics

3.2 Mixtral 8x22B

  • Larger experts (22B scale)
  • Total ~176B parameters
  • Further performance improvements

4. DeepSeek-MoE

4.1 Fine-Grained Experts

DeepSeek-MoE: Further subdivides experts.

Architecture:

  • 64 routed experts + 2 shared experts
  • Top-6 routing
  • Each expert is smaller (fine-grained)

Shared Experts:

\[ y = \sum_{i=1}^{N_s} E_i^{\text{shared}}(x) + \sum_{j \in \text{TopK}} G(x)_j \cdot E_j^{\text{routed}}(x) \]

Shared experts are always activated, capturing universal knowledge; routed experts handle specialized knowledge.

4.2 DeepSeek-V2/V3

DeepSeek-V2:

  • MLA (Multi-Head Latent Attention) reduces KV cache
  • DeepSeekMoE architecture: 160 routed + 2 shared experts
  • 21B active per token / 236B total parameters

DeepSeek-V3:

  • 671B total parameters, 37B active per token
  • 256 routed experts + 1 shared expert
  • Auxiliary-loss-free load balancing strategy

5. Llama 4 MoE

5.1 Llama 4 Scout & Maverick

Llama 4 (Meta, 2025) adopts MoE architecture:

  • Scout: 16 experts, 1 active per token
  • Maverick: 128 experts, 1 active per token

Features:

  • Interleaved attention layers (not every layer has MoE)
  • iRoPE: Alternating attention layers with/without positional encoding
  • Native multimodal: Supports images and text

6. Jamba: MoE + SSM

6.1 Hybrid Architecture

Jamba (AI21, 2024): Combines Transformer attention, Mamba SSM, and MoE.

Architecture:

Layer 1-2: Mamba + MoE
Layer 3:   Attention + MoE  
Layer 4-5: Mamba + MoE
Layer 6:   Attention + MoE
...

Advantages:

  • Mamba efficiently handles long sequences
  • Attention maintains global modeling capability
  • MoE expands model capacity
  • 52B total parameters, 12B active, supports 256K context

7. Training Challenges

7.1 Load Imbalance

Problem: Some experts are selected too frequently (hot experts), others almost never.

Solutions:

Solution Description
Auxiliary loss Add balancing loss to encourage uniform allocation
Expert Choice Experts choose tokens rather than tokens choosing experts
Capacity limits Limit tokens processed per expert
Random routing Add randomness

7.2 Expert Collapse

Problem: Some experts gradually degrade and stop being selected, forming "dead experts."

Mitigation:

  • Ensure expert diversity at initialization
  • Periodically reinitialize cold experts
  • Use stronger regularization
  • DeepSeek-V3's auxiliary-loss-free load balancing

7.3 Communication Overhead

Expert Parallelism: Different experts distributed across different GPUs.

  • Each MoE layer requires AllToAll communication (dispatching tokens to corresponding expert GPUs)
  • Communication volume scales with number of active experts and tokens
  • Requires high-speed interconnects (NVLink, InfiniBand)
graph LR
    subgraph GPU0
    T1[Tokens] --> E1[Expert 0]
    T1 --> E2[Expert 1]
    end

    subgraph GPU1
    T2[Tokens] --> E3[Expert 2]
    T2 --> E4[Expert 3]
    end

    T1 -.->|AllToAll| E3
    T1 -.->|AllToAll| E4
    T2 -.->|AllToAll| E1
    T2 -.->|AllToAll| E2

8. MoE vs Dense Models

Dimension Dense MoE
Parameter efficiency Low (all activated) High (sparse activation)
Training FLOPs High Lower (at same performance)
Inference speed Stable Affected by routing and communication
Memory requirement \(\Phi\) \(N \times \Phi_e\) (must store all experts)
Fine-tuning Simple Complex (routing stability)
Training stability Good Requires extra techniques
Scalability Linear Sub-linear compute growth

9. Summary

Model Year Experts TopK Total Params Active Params
Switch Transformer 2022 128 1 - -
Mixtral 8x7B 2024 8 2 47B 13B
DeepSeek-V2 2024 160+2 6 236B 21B
DeepSeek-V3 2024 256+1 8 671B 37B
Llama 4 Scout 2025 16 1 - 17B
Jamba 2024 16 2 52B 12B

Key Trends:

  1. From coarse-grained to fine-grained experts
  2. Hybrid design with shared + routed experts
  3. Fusion of MoE with SSM (Jamba)
  4. Smarter load balancing strategies
  5. Native multimodal MoE

References

  • Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017
  • Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," JMLR 2022
  • Jiang et al., "Mixtral of Experts," 2024
  • Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models," ACL 2024
  • Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model," 2024

评论 #