Large Model Scaling and Sparse Architectures: A Deep Learning Syllabus

Phase 1: Scaling Laws — Theory and First Principles

1. Core Theory of Scaling Laws

Kaplan Scaling Laws (OpenAI): Power-law relationships among parameters, compute, and data.
Chinchilla Laws (DeepMind): Compute-optimal training theory.
Three dimensions of scaling: Model size (\(N\)), training data volume (\(D\)), and compute budget (\(C\)).
Diminishing returns: Saturation points and irreducible loss.

2. Data and Token Budgets

Data-constrained Scaling: Strategies for when high-quality internet data is exhausted.
Impact of epochs: How multi-epoch training reshapes scaling laws.
Token quality and diversity: The shifting effect of data cleaning on scaling curves.

Phase 2: Architectural Evolution from Dense to Sparse

3. Bottlenecks of Dense Transformers

Coupling of compute and parameters: The linear constraint of \(Compute \propto Parameters\).
Inference cost challenges: The tension between memory bandwidth and peak compute.
Scaling pressure from KV Cache.

4. Fundamentals of Sparsity

Structured sparsity vs. unstructured sparsity.
Sparse Attention: Sliding windows, local attention, etc. (Longformer, BigBird).
Sparse FFN: The precursor to MoE.

Phase 3: Mixture of Experts (MoE) — An In-Depth Deconstruction

5. MoE Fundamentals and Components

Expert Layer: Structural design of expert layers.
Gating Network (Router): The mathematical essence of routing.
Routing strategies:
- Top-k Routing (Top-1 vs. Top-2).
- Noisy Top-k Gating (injecting noise to ensure diversity).
- Softmax vs. Sigmoid Routing.

6. Core Challenges of MoE: Training Stability and Load Balancing

Expert Collapse: Why do models over-concentrate on a handful of experts?
Load Balancing Loss: Designing auxiliary loss functions.
Capacity Factor: Addressing token overflow and dropping.
Expert Dropping: Edge cases in token processing.

7. Advanced Routing and Architectural Variants

Switch Transformer: Extreme Top-1 routing.
GLaM: A sparse model optimized for few-shot learning.
DeepSeek MoE innovations:
- Shared Experts: Eliminating redundant information.
- Fine-grained Experts: Improving expert utilization.
Expert Prototyping: Clustering experts via prototypes.

Phase 4: Distributed Training and Systems Engineering

8. Parallelism Strategies

Data Parallelism (DP/DDP): Standard data parallelism.
Model Parallelism (MP):
- Tensor Parallelism (TP): Intra-layer parallelism.
- Pipeline Parallelism (PP): Inter-layer parallelism.
Expert Parallelism (EP): MoE-specific expert-level parallelism.

9. Memory and Communication Optimization

ZeRO (Zero Redundancy Optimizer): Detailed breakdown of Stage 1/2/3.
FSDP (Fully Sharded Data Parallel): Practical implementation in PyTorch.
Communication Overheads: The All-to-All communication bottleneck in MoE.
Mixed Precision: Applications of FP16, BF16, and FP8 in large-scale training.

Phase 5: Inference, Deployment, and Evaluation

10. MoE Inference Optimization

Expert Offloading: Running large MoE models on memory-constrained devices.
Sparse operator acceleration: Optimizing dynamic routing at the CUDA kernel level.
Speculative Decoding: Speculative sampling techniques tailored for MoE.

11. Classic MoE Model Case Studies

Mixtral 8x7B: The open-source benchmark for modern MoE.
Grok-1: Routing in practice at massive parameter scale.
Jamba: Combining MoE with Mamba (SSM).

Phase 6: Future Directions

12. What Comes Next in Architecture

Ultimate compute-parameter decoupling: Dynamic compute allocation.
Multimodal MoE: Specialization of experts across different modalities.
Re-deriving scaling laws for MoE: Active parameters vs. total parameters.

Supplementary Material: Core Concepts in Detail

A. Scaling Laws Detailed Derivations

Kaplan Scaling Laws (2020)

Kaplan et al. (2020), in "Scaling Laws for Neural Language Models," systematically revealed the power-law relationships between language model performance and three core variables.

Core formulas:

\[L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076\]

\[L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095\]

\[L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050\]

where \(L\) is cross-entropy loss, \(N\) is the number of model parameters (excluding embeddings), \(D\) is the volume of training data (in tokens), and \(C\) is the compute budget (in FLOPs). \(N_c, D_c, C_c\) are constants.

Key findings:

Power-law relationships: Loss decreases smoothly with increasing \(N\), \(D\), and \(C\), following a power law -- approximately linear on log-log axes.
Parameters matter more: Under a fixed compute budget, Kaplan argued that priority should be given to increasing model size \(N\) rather than training data \(D\). Their recommended optimal ratio was approximately \(N \propto C^{0.73}\).
Irreducible loss: There exists a lower bound \(L_\infty\) below which loss cannot fall regardless of resources, because natural language itself has irreducible entropy.

Joint Scaling Law:

\[L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}\]

This shows that when \(N\) and \(D\) are mismatched (e.g., large model + little data), significant performance degradation occurs.

Chinchilla Scaling Laws (2022)

Hoffmann et al. (2022), in "Training Compute-Optimal Large Language Models," offered an important correction to Kaplan's conclusions.

Core insight: Kaplan underestimated the importance of data.

The Chinchilla Law's core formula:

\[L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}\]

where \(E\) is the irreducible loss, \(A\) and \(B\) are constants, \(\alpha \approx 0.34\), and \(\beta \approx 0.28\).

Compute-optimal allocation: Given a fixed compute budget \(C \approx 6ND\) (FLOPs), the optimal strategy is to scale parameters and data proportionally:

\[N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}\]

That is, when model parameters double, training data should also double.

Empirical validation: Chinchilla (70B parameters, 1.4T tokens) significantly outperformed Gopher (280B parameters, 300B tokens) under the same compute budget, demonstrating that Gopher was severely under-trained.

Model	Parameters	Training Tokens	Compute	Conclusion
Gopher	280B	300B	~\(C\)	Over-parameterized, data-starved
Chinchilla	70B	1.4T	~\(C\)	Compute-optimal configuration

Training-Optimal vs. Inference-Optimal Trade-off

Chinchilla's "compute-optimal" perspective is framed from the standpoint of training cost. In real-world deployment, however, inference cost often vastly exceeds training cost.

Inference-optimal perspective: Smaller model + more training data = lower inference cost.

This is the design philosophy behind the LLaMA family: LLaMA-7B was trained on 1T tokens, far exceeding the Chinchilla-optimal ratio, but it is more lightweight and efficient at inference time.

\[\text{Training-optimal} \neq \text{Inference-optimal}\]

Practical trade-offs:

If a model is trained once but serves billions of inference requests, one should over-train a smaller model
If the compute budget is limited and only a small number of inferences are needed, the Chinchilla ratio should be followed
Scaling Laws define the Pareto frontier of training efficiency; real-world deployment requires holistic consideration

B. MoE (Mixture of Experts) Core Mechanisms

The Mathematical Essence of Sparse Activation

The core idea of MoE: not all parameters participate in computing every token.

Given input \(x\), the MoE layer computes:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

where \(g_i(x)\) is the gating function (router). For Top-k routing, most \(g_i = 0\).

Switch Transformer (Fedus et al., 2021):

Extreme simplification: Top-1 routing, each token activates only one expert
Routing function: \(g(x) = \text{Softmax}(W_g \cdot x)\), taking the argmax
Advantages: minimal communication, simple implementation
Challenge: unbalanced expert loads, requiring an auxiliary loss:

\[\mathcal{L}_{balance} = \alpha \cdot E \cdot \sum_{i=1}^{E} f_i \cdot P_i\]

where \(f_i\) is the fraction of tokens assigned to expert \(i\), and \(P_i\) is the mean routing probability.

Mixtral 8x7B (Mistral AI, 2024):

8 experts, Top-2 activation per token
Total parameters: 46.7B; active parameters: ~12.9B
Key innovation: per-layer independent routing, no shared experts
Matches or exceeds LLaMA-2 70B on most benchmarks

Core Issues in Expert Routing

Expert Collapse: Some experts are over-utilized while others "starve." The root cause is a positive feedback loop in the router -- experts selected more often receive more gradient updates, making them even more likely to be selected.
Token Dropping: When the capacity factor is insufficient, overflow tokens are dropped, causing information loss.
Communication Bottleneck: Expert Parallelism requires All-to-All communication, and bandwidth becomes the limiting factor.

C. Long-Context Architectures

Evolution of Positional Encoding

Traditional Transformers use absolute positional encodings (sinusoidal or learned), but these suffer from poor length extrapolation.

RoPE (Rotary Position Embedding, Su et al., 2021):

Core idea: encode positional information as rotation matrices so that attention scores depend only on relative positions.

\[f_q(x_m, m) = (W_q x_m) e^{im\theta}\]

\[\langle f_q(x_m, m), f_k(x_n, n) \rangle = \text{Re}\left[(W_q x_m)(W_k x_n)^* e^{i(m-n)\theta}\right]\]

Advantages:

Naturally encodes relative positions
Can be extended to longer contexts via NTK-aware interpolation
Adopted by mainstream models including LLaMA, Qwen, and Mistral

ALiBi (Attention with Linear Biases, Press et al., 2022):

Core idea: instead of using positional encodings, directly add a linear bias proportional to distance onto attention scores:

\[\text{Attention}(q_i, k_j) = q_i \cdot k_j - m \cdot |i - j|\]

where \(m\) is a per-head hyperparameter slope.