Skip to content

Large Model Scaling and Sparse Architectures: A Deep Learning Syllabus

Phase 1: Scaling Laws — Theory and First Principles

1. Core Theory of Scaling Laws

  • Kaplan Scaling Laws (OpenAI): Power-law relationships among parameters, compute, and data.
  • Chinchilla Laws (DeepMind): Compute-optimal training theory.
  • Three dimensions of scaling: Model size (\(N\)), training data volume (\(D\)), and compute budget (\(C\)).
  • Diminishing returns: Saturation points and irreducible loss.

2. Data and Token Budgets

  • Data-constrained Scaling: Strategies for when high-quality internet data is exhausted.
  • Impact of epochs: How multi-epoch training reshapes scaling laws.
  • Token quality and diversity: The shifting effect of data cleaning on scaling curves.

Phase 2: Architectural Evolution from Dense to Sparse

3. Bottlenecks of Dense Transformers

  • Coupling of compute and parameters: The linear constraint of \(Compute \propto Parameters\).
  • Inference cost challenges: The tension between memory bandwidth and peak compute.
  • Scaling pressure from KV Cache.

4. Fundamentals of Sparsity

  • Structured sparsity vs. unstructured sparsity.
  • Sparse Attention: Sliding windows, local attention, etc. (Longformer, BigBird).
  • Sparse FFN: The precursor to MoE.

Phase 3: Mixture of Experts (MoE) — An In-Depth Deconstruction

5. MoE Fundamentals and Components

  • Expert Layer: Structural design of expert layers.
  • Gating Network (Router): The mathematical essence of routing.
  • Routing strategies:
    • Top-k Routing (Top-1 vs. Top-2).
    • Noisy Top-k Gating (injecting noise to ensure diversity).
    • Softmax vs. Sigmoid Routing.

6. Core Challenges of MoE: Training Stability and Load Balancing

  • Expert Collapse: Why do models over-concentrate on a handful of experts?
  • Load Balancing Loss: Designing auxiliary loss functions.
  • Capacity Factor: Addressing token overflow and dropping.
  • Expert Dropping: Edge cases in token processing.

7. Advanced Routing and Architectural Variants

  • Switch Transformer: Extreme Top-1 routing.
  • GLaM: A sparse model optimized for few-shot learning.
  • DeepSeek MoE innovations:
    • Shared Experts: Eliminating redundant information.
    • Fine-grained Experts: Improving expert utilization.
  • Expert Prototyping: Clustering experts via prototypes.

Phase 4: Distributed Training and Systems Engineering

8. Parallelism Strategies

  • Data Parallelism (DP/DDP): Standard data parallelism.
  • Model Parallelism (MP):
    • Tensor Parallelism (TP): Intra-layer parallelism.
    • Pipeline Parallelism (PP): Inter-layer parallelism.
  • Expert Parallelism (EP): MoE-specific expert-level parallelism.

9. Memory and Communication Optimization

  • ZeRO (Zero Redundancy Optimizer): Detailed breakdown of Stage 1/2/3.
  • FSDP (Fully Sharded Data Parallel): Practical implementation in PyTorch.
  • Communication Overheads: The All-to-All communication bottleneck in MoE.
  • Mixed Precision: Applications of FP16, BF16, and FP8 in large-scale training.

Phase 5: Inference, Deployment, and Evaluation

10. MoE Inference Optimization

  • Expert Offloading: Running large MoE models on memory-constrained devices.
  • Sparse operator acceleration: Optimizing dynamic routing at the CUDA kernel level.
  • Speculative Decoding: Speculative sampling techniques tailored for MoE.

11. Classic MoE Model Case Studies

  • Mixtral 8x7B: The open-source benchmark for modern MoE.
  • Grok-1: Routing in practice at massive parameter scale.
  • Jamba: Combining MoE with Mamba (SSM).

Phase 6: Future Directions

12. What Comes Next in Architecture

  • Ultimate compute-parameter decoupling: Dynamic compute allocation.
  • Multimodal MoE: Specialization of experts across different modalities.
  • Re-deriving scaling laws for MoE: Active parameters vs. total parameters.

Supplementary Material: Core Concepts in Detail

A. Scaling Laws Detailed Derivations

Kaplan Scaling Laws (2020)

Kaplan et al. (2020), in "Scaling Laws for Neural Language Models," systematically revealed the power-law relationships between language model performance and three core variables.

Core formulas:

\[L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076\]
\[L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095\]
\[L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050\]

where \(L\) is cross-entropy loss, \(N\) is the number of model parameters (excluding embeddings), \(D\) is the volume of training data (in tokens), and \(C\) is the compute budget (in FLOPs). \(N_c, D_c, C_c\) are constants.

Key findings:

  1. Power-law relationships: Loss decreases smoothly with increasing \(N\), \(D\), and \(C\), following a power law -- approximately linear on log-log axes.
  2. Parameters matter more: Under a fixed compute budget, Kaplan argued that priority should be given to increasing model size \(N\) rather than training data \(D\). Their recommended optimal ratio was approximately \(N \propto C^{0.73}\).
  3. Irreducible loss: There exists a lower bound \(L_\infty\) below which loss cannot fall regardless of resources, because natural language itself has irreducible entropy.

Joint Scaling Law:

\[L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}\]

This shows that when \(N\) and \(D\) are mismatched (e.g., large model + little data), significant performance degradation occurs.

Chinchilla Scaling Laws (2022)

Hoffmann et al. (2022), in "Training Compute-Optimal Large Language Models," offered an important correction to Kaplan's conclusions.

Core insight: Kaplan underestimated the importance of data.

The Chinchilla Law's core formula:

\[L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}\]

where \(E\) is the irreducible loss, \(A\) and \(B\) are constants, \(\alpha \approx 0.34\), and \(\beta \approx 0.28\).

Compute-optimal allocation: Given a fixed compute budget \(C \approx 6ND\) (FLOPs), the optimal strategy is to scale parameters and data proportionally:

\[N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}\]

That is, when model parameters double, training data should also double.

Empirical validation: Chinchilla (70B parameters, 1.4T tokens) significantly outperformed Gopher (280B parameters, 300B tokens) under the same compute budget, demonstrating that Gopher was severely under-trained.

Model Parameters Training Tokens Compute Conclusion
Gopher 280B 300B ~\(C\) Over-parameterized, data-starved
Chinchilla 70B 1.4T ~\(C\) Compute-optimal configuration

Training-Optimal vs. Inference-Optimal Trade-off

Chinchilla's "compute-optimal" perspective is framed from the standpoint of training cost. In real-world deployment, however, inference cost often vastly exceeds training cost.

Inference-optimal perspective: Smaller model + more training data = lower inference cost.

This is the design philosophy behind the LLaMA family: LLaMA-7B was trained on 1T tokens, far exceeding the Chinchilla-optimal ratio, but it is more lightweight and efficient at inference time.

\[\text{Training-optimal} \neq \text{Inference-optimal}\]

Practical trade-offs:

  • If a model is trained once but serves billions of inference requests, one should over-train a smaller model
  • If the compute budget is limited and only a small number of inferences are needed, the Chinchilla ratio should be followed
  • Scaling Laws define the Pareto frontier of training efficiency; real-world deployment requires holistic consideration

B. MoE (Mixture of Experts) Core Mechanisms

The Mathematical Essence of Sparse Activation

The core idea of MoE: not all parameters participate in computing every token.

Given input \(x\), the MoE layer computes:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

where \(g_i(x)\) is the gating function (router). For Top-k routing, most \(g_i = 0\).

Switch Transformer (Fedus et al., 2021):

  • Extreme simplification: Top-1 routing, each token activates only one expert
  • Routing function: \(g(x) = \text{Softmax}(W_g \cdot x)\), taking the argmax
  • Advantages: minimal communication, simple implementation
  • Challenge: unbalanced expert loads, requiring an auxiliary loss:
\[\mathcal{L}_{balance} = \alpha \cdot E \cdot \sum_{i=1}^{E} f_i \cdot P_i\]

where \(f_i\) is the fraction of tokens assigned to expert \(i\), and \(P_i\) is the mean routing probability.

Mixtral 8x7B (Mistral AI, 2024):

  • 8 experts, Top-2 activation per token
  • Total parameters: 46.7B; active parameters: ~12.9B
  • Key innovation: per-layer independent routing, no shared experts
  • Matches or exceeds LLaMA-2 70B on most benchmarks

Core Issues in Expert Routing

  1. Expert Collapse: Some experts are over-utilized while others "starve." The root cause is a positive feedback loop in the router -- experts selected more often receive more gradient updates, making them even more likely to be selected.
  2. Token Dropping: When the capacity factor is insufficient, overflow tokens are dropped, causing information loss.
  3. Communication Bottleneck: Expert Parallelism requires All-to-All communication, and bandwidth becomes the limiting factor.

C. Long-Context Architectures

Evolution of Positional Encoding

Traditional Transformers use absolute positional encodings (sinusoidal or learned), but these suffer from poor length extrapolation.

RoPE (Rotary Position Embedding, Su et al., 2021):

Core idea: encode positional information as rotation matrices so that attention scores depend only on relative positions.

\[f_q(x_m, m) = (W_q x_m) e^{im\theta}\]
\[\langle f_q(x_m, m), f_k(x_n, n) \rangle = \text{Re}\left[(W_q x_m)(W_k x_n)^* e^{i(m-n)\theta}\right]\]

Advantages:

  • Naturally encodes relative positions
  • Can be extended to longer contexts via NTK-aware interpolation
  • Adopted by mainstream models including LLaMA, Qwen, and Mistral

ALiBi (Attention with Linear Biases, Press et al., 2022):

Core idea: instead of using positional encodings, directly add a linear bias proportional to distance onto attention scores:

\[\text{Attention}(q_i, k_j) = q_i \cdot k_j - m \cdot |i - j|\]

where \(m\) is a per-head hyperparameter slope.

Advantages:

  • Zero additional parameters
  • Naturally supports length extrapolation
  • Extremely simple to implement

Ring Attention

When sequence length exceeds the memory capacity of a single GPU, Ring Attention provides a distributed solution:

  • The long sequence is split into chunks distributed across devices
  • Each device computes local attention while passing KV blocks in a ring topology
  • Computation and communication overlap, theoretically supporting unlimited sequence length
  • A key enabling technology for million-token context models such as Gemini

D. Architecture Search and Efficient Architectures

Efficient Transformer Variants

Method Core Idea Complexity Representative Models
Standard Attention Full self-attention \(O(n^2)\) GPT, BERT
Sparse Attention Local/fixed patterns \(O(n\sqrt{n})\) Longformer, BigBird
Linear Attention Kernel approximation \(O(n)\) Performer, RWKV
Flash Attention IO-aware exact attention \(O(n^2)\), but very fast FlashAttention-2/3
State Space Models Recurrent state updates \(O(n)\) Mamba, S4

Flash Attention does not change the mathematical form of attention; instead, it drastically reduces HBM accesses through tiling and kernel fusion. This is a systems-level algorithmic optimization.

Mamba (Gu & Dao, 2023):

  • Selective State Space Model (Selective SSM)
  • Input-dependent selection mechanism that resolves the inability of traditional SSMs to perform content-aware processing
  • Linear time complexity with significant advantages on long-sequence tasks
  • Jamba = Mamba + MoE, combining the strengths of both

Neural Architecture Search (NAS) in the Age of Large Models

Traditional NAS is constrained in the large-model era (search spaces are too vast), but certain ideas remain valuable:

  • Scaling-aware NAS: Search at small scale, then use Scaling Laws to predict large-scale performance
  • Modular search: Fix the overall architecture and search over local modules (e.g., FFN width ratio, number of attention heads)
  • Compound Scaling: Jointly optimize depth, width, and resolution (extending the EfficientNet philosophy to LLMs)

评论 #