Mixture of Experts (MoE)

Overview

Mixture of Experts (MoE) dramatically scales model capacity through sparse activation without proportionally increasing computation. From the Switch Transformer to Mixtral, DeepSeek-MoE, and Llama 4, MoE has become the mainstream architecture for scaling large models.

graph TD
    A[MoE Architecture] --> B[Gating Network]
    A --> C[Expert Networks]
    A --> D[Load Balancing]

    B --> B1[Top-K Gating]
    B --> B2[Noisy Top-K]
    B --> B3[Expert Choice]

    C --> C1[Dense FFN]
    C --> C2[Shared + Routed Experts]

    D --> D1[Auxiliary Loss]
    D --> D2[Expert Parallelism]

    subgraph Representative Models
    E1[Switch Transformer]
    E2[Mixtral 8x7B]
    E3[DeepSeek-MoE]
    E4[Llama 4 MoE]
    E5[Jamba]
    end

1. MoE Fundamentals

1.1 Core Idea

In a standard Transformer, every token passes through all FFN parameters. MoE replaces the FFN with multiple "experts," where each token activates only a subset of experts.

Sparse Gating Function:

\[ G(x) = \text{TopK}(\text{softmax}(W_g x)) \]

MoE Layer Output:

\[ y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x) \]

where:

\(N\): Total number of experts
\(E_i\): The \(i\)-th expert network (typically an FFN)
\(G(x)_i\): Weight assigned to the \(i\)-th expert by the gating network
TopK: Retains only the top K values, setting the rest to 0

1.2 Key Formulas

Noisy Top-K Gating (Shazeer et al., 2017):

\[ H(x) = W_g x + \text{StandardNormal}() \cdot \text{softplus}(W_{\text{noise}} x) \]

\[ G(x) = \text{softmax}(\text{TopK}(H(x), k)) \]

Adding noise helps with exploration and load balancing.

1.3 Computational Efficiency

Model Type	Total Params	Active Params per Token	Compute
Dense 7B	7B	7B	1x
MoE 8x7B (Top-2)	47B	~13B	~2x Dense 7B
MoE 16x7B (Top-2)	100B+	~13B	~2x Dense 7B

Core Advantage: Parameters scale dramatically, but per-token computation depends only on the number of activated experts.

2. Switch Transformer

2.1 Design

Switch Transformer (Fedus et al., 2022): Top-1 routing, each token selects only one expert.

Simplified Routing:

\[ G(x) = \text{Top1}(\text{softmax}(W_g x)) \]

Design Principles:

Top-1 is simpler and more efficient than Top-2
Large number of experts (128, 256)
Limited expert capacity (expert capacity factor)

2.2 Capacity Factor

\[ \text{Expert Capacity} = \left(\frac{n}{N}\right) \times \text{capacity\_factor} \]

where \(n\) is the number of tokens and \(N\) is the number of experts. Tokens exceeding capacity are dropped (overflow).

2.3 Load Balancing Loss

\[ \mathcal{L}_{\text{balance}} = N \sum_{i=1}^{N} f_i \cdot P_i \]

where:

\(f_i\): Fraction of tokens routed to expert \(i\)
\(P_i\): Average probability assigned to expert \(i\) by the gating network

This loss encourages both \(f_i\) and \(P_i\) to approach uniform \(1/N\).

3. Mixtral

3.1 Mixtral 8x7B

Mixtral (Jiang et al., 2024): A milestone in open-source MoE.

Architecture:

8 experts, each approximately 7B parameters (the FFN portion)
Top-2 routing
Total ~47B parameters, ~13B active
Performance matches LLaMA 2 70B

Routing Analysis:

Different layers exhibit different routing patterns
Some experts are more active for certain domains/languages
But experts lack clear "specialized" topics

3.2 Mixtral 8x22B

Larger experts (22B scale)
Total ~176B parameters
Further performance improvements

4. DeepSeek-MoE

4.1 Fine-Grained Experts

DeepSeek-MoE: Further subdivides experts.

Architecture:

64 routed experts + 2 shared experts
Top-6 routing
Each expert is smaller (fine-grained)

Shared Experts:

\[ y = \sum_{i=1}^{N_s} E_i^{\text{shared}}(x) + \sum_{j \in \text{TopK}} G(x)_j \cdot E_j^{\text{routed}}(x) \]

Shared experts are always activated, capturing universal knowledge; routed experts handle specialized knowledge.

4.2 DeepSeek-V2/V3

DeepSeek-V2:

MLA (Multi-Head Latent Attention) reduces KV cache
DeepSeekMoE architecture: 160 routed + 2 shared experts
21B active per token / 236B total parameters

DeepSeek-V3:

671B total parameters, 37B active per token
256 routed experts + 1 shared expert
Auxiliary-loss-free load balancing strategy

5. Llama 4 MoE

5.1 Llama 4 Scout & Maverick

Llama 4 (Meta, 2025) adopts MoE architecture:

Scout: 16 experts, 1 active per token
Maverick: 128 experts, 1 active per token

Features:

Interleaved attention layers (not every layer has MoE)
iRoPE: Alternating attention layers with/without positional encoding
Native multimodal: Supports images and text

6. Jamba: MoE + SSM

6.1 Hybrid Architecture

Jamba (AI21, 2024): Combines Transformer attention, Mamba SSM, and MoE.

Architecture:

Layer 1-2: Mamba + MoE
Layer 3:   Attention + MoE  
Layer 4-5: Mamba + MoE
Layer 6:   Attention + MoE
...

Advantages:

Mamba efficiently handles long sequences
Attention maintains global modeling capability
MoE expands model capacity
52B total parameters, 12B active, supports 256K context

7. Training Challenges

7.1 Load Imbalance

Problem: Some experts are selected too frequently (hot experts), others almost never.

Solutions:

Solution	Description
Auxiliary loss	Add balancing loss to encourage uniform allocation
Expert Choice	Experts choose tokens rather than tokens choosing experts
Capacity limits	Limit tokens processed per expert
Random routing	Add randomness

7.2 Expert Collapse

Problem: Some experts gradually degrade and stop being selected, forming "dead experts."

Mitigation:

Ensure expert diversity at initialization
Periodically reinitialize cold experts
Use stronger regularization
DeepSeek-V3's auxiliary-loss-free load balancing

7.3 Communication Overhead

Expert Parallelism: Different experts distributed across different GPUs.

Each MoE layer requires AllToAll communication (dispatching tokens to corresponding expert GPUs)
Communication volume scales with number of active experts and tokens
Requires high-speed interconnects (NVLink, InfiniBand)

graph LR
    subgraph GPU0
    T1[Tokens] --> E1[Expert 0]
    T1 --> E2[Expert 1]
    end

    subgraph GPU1
    T2[Tokens] --> E3[Expert 2]
    T2 --> E4[Expert 3]
    end

    T1 -.->|AllToAll| E3
    T1 -.->|AllToAll| E4
    T2 -.->|AllToAll| E1
    T2 -.->|AllToAll| E2

8. MoE vs Dense Models

Dimension	Dense	MoE
Parameter efficiency	Low (all activated)	High (sparse activation)
Training FLOPs	High	Lower (at same performance)
Inference speed	Stable	Affected by routing and communication
Memory requirement	\(\Phi\)	\(N \times \Phi_e\) (must store all experts)
Fine-tuning	Simple	Complex (routing stability)
Training stability	Good	Requires extra techniques
Scalability	Linear	Sub-linear compute growth

9. Summary

Model	Year	Experts	TopK	Total Params	Active Params
Switch Transformer	2022	128	1	-	-
Mixtral 8x7B	2024	8	2	47B	13B
DeepSeek-V2	2024	160+2	6	236B	21B
DeepSeek-V3	2024	256+1	8	671B	37B
Llama 4 Scout	2025	16	1	-	17B
Jamba	2024	16	2	52B	12B

Key Trends:

From coarse-grained to fine-grained experts
Hybrid design with shared + routed experts
Fusion of MoE with SSM (Jamba)
Smarter load balancing strategies
Native multimodal MoE

References

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," JMLR 2022
Jiang et al., "Mixtral of Experts," 2024
Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models," ACL 2024
Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model," 2024

Mixture of Experts (MoE)

Overview

1. MoE Fundamentals

1.1 Core Idea

1.2 Key Formulas

1.3 Computational Efficiency

2. Switch Transformer

2.1 Design

2.2 Capacity Factor

2.3 Load Balancing Loss

3. Mixtral

3.1 Mixtral 8x7B

3.2 Mixtral 8x22B

4. DeepSeek-MoE

4.1 Fine-Grained Experts

4.2 DeepSeek-V2/V3

5. Llama 4 MoE

5.1 Llama 4 Scout & Maverick

6. Jamba: MoE + SSM

6.1 Hybrid Architecture

7. Training Challenges

7.1 Load Imbalance

7.2 Expert Collapse

7.3 Communication Overhead

8. MoE vs Dense Models

9. Summary

References

评论 #