Large Model Scaling and Sparse Architectures: A Deep Learning Syllabus
Phase 1: Scaling Laws — Theory and First Principles
1. Core Theory of Scaling Laws
- Kaplan Scaling Laws (OpenAI): Power-law relationships among parameters, compute, and data.
- Chinchilla Laws (DeepMind): Compute-optimal training theory.
- Three dimensions of scaling: Model size (\(N\)), training data volume (\(D\)), and compute budget (\(C\)).
- Diminishing returns: Saturation points and irreducible loss.
2. Data and Token Budgets
- Data-constrained Scaling: Strategies for when high-quality internet data is exhausted.
- Impact of epochs: How multi-epoch training reshapes scaling laws.
- Token quality and diversity: The shifting effect of data cleaning on scaling curves.
Phase 2: Architectural Evolution from Dense to Sparse
3. Bottlenecks of Dense Transformers
- Coupling of compute and parameters: The linear constraint of \(Compute \propto Parameters\).
- Inference cost challenges: The tension between memory bandwidth and peak compute.
- Scaling pressure from KV Cache.
4. Fundamentals of Sparsity
- Structured sparsity vs. unstructured sparsity.
- Sparse Attention: Sliding windows, local attention, etc. (Longformer, BigBird).
- Sparse FFN: The precursor to MoE.
Phase 3: Mixture of Experts (MoE) — An In-Depth Deconstruction
5. MoE Fundamentals and Components
- Expert Layer: Structural design of expert layers.
- Gating Network (Router): The mathematical essence of routing.
- Routing strategies:
- Top-k Routing (Top-1 vs. Top-2).
- Noisy Top-k Gating (injecting noise to ensure diversity).
- Softmax vs. Sigmoid Routing.
6. Core Challenges of MoE: Training Stability and Load Balancing
- Expert Collapse: Why do models over-concentrate on a handful of experts?
- Load Balancing Loss: Designing auxiliary loss functions.
- Capacity Factor: Addressing token overflow and dropping.
- Expert Dropping: Edge cases in token processing.
7. Advanced Routing and Architectural Variants
- Switch Transformer: Extreme Top-1 routing.
- GLaM: A sparse model optimized for few-shot learning.
- DeepSeek MoE innovations:
- Shared Experts: Eliminating redundant information.
- Fine-grained Experts: Improving expert utilization.
- Expert Prototyping: Clustering experts via prototypes.
Phase 4: Distributed Training and Systems Engineering
8. Parallelism Strategies
- Data Parallelism (DP/DDP): Standard data parallelism.
- Model Parallelism (MP):
- Tensor Parallelism (TP): Intra-layer parallelism.
- Pipeline Parallelism (PP): Inter-layer parallelism.
- Expert Parallelism (EP): MoE-specific expert-level parallelism.
9. Memory and Communication Optimization
- ZeRO (Zero Redundancy Optimizer): Detailed breakdown of Stage 1/2/3.
- FSDP (Fully Sharded Data Parallel): Practical implementation in PyTorch.
- Communication Overheads: The All-to-All communication bottleneck in MoE.
- Mixed Precision: Applications of FP16, BF16, and FP8 in large-scale training.
Phase 5: Inference, Deployment, and Evaluation
10. MoE Inference Optimization
- Expert Offloading: Running large MoE models on memory-constrained devices.
- Sparse operator acceleration: Optimizing dynamic routing at the CUDA kernel level.
- Speculative Decoding: Speculative sampling techniques tailored for MoE.
11. Classic MoE Model Case Studies
- Mixtral 8x7B: The open-source benchmark for modern MoE.
- Grok-1: Routing in practice at massive parameter scale.
- Jamba: Combining MoE with Mamba (SSM).
Phase 6: Future Directions
12. What Comes Next in Architecture
- Ultimate compute-parameter decoupling: Dynamic compute allocation.
- Multimodal MoE: Specialization of experts across different modalities.
- Re-deriving scaling laws for MoE: Active parameters vs. total parameters.
Supplementary Material: Core Concepts in Detail
A. Scaling Laws Detailed Derivations
Kaplan Scaling Laws (2020)
Kaplan et al. (2020), in "Scaling Laws for Neural Language Models," systematically revealed the power-law relationships between language model performance and three core variables.
Core formulas:
where \(L\) is cross-entropy loss, \(N\) is the number of model parameters (excluding embeddings), \(D\) is the volume of training data (in tokens), and \(C\) is the compute budget (in FLOPs). \(N_c, D_c, C_c\) are constants.
Key findings:
- Power-law relationships: Loss decreases smoothly with increasing \(N\), \(D\), and \(C\), following a power law -- approximately linear on log-log axes.
- Parameters matter more: Under a fixed compute budget, Kaplan argued that priority should be given to increasing model size \(N\) rather than training data \(D\). Their recommended optimal ratio was approximately \(N \propto C^{0.73}\).
- Irreducible loss: There exists a lower bound \(L_\infty\) below which loss cannot fall regardless of resources, because natural language itself has irreducible entropy.
Joint Scaling Law:
This shows that when \(N\) and \(D\) are mismatched (e.g., large model + little data), significant performance degradation occurs.
Chinchilla Scaling Laws (2022)
Hoffmann et al. (2022), in "Training Compute-Optimal Large Language Models," offered an important correction to Kaplan's conclusions.
Core insight: Kaplan underestimated the importance of data.
The Chinchilla Law's core formula:
where \(E\) is the irreducible loss, \(A\) and \(B\) are constants, \(\alpha \approx 0.34\), and \(\beta \approx 0.28\).
Compute-optimal allocation: Given a fixed compute budget \(C \approx 6ND\) (FLOPs), the optimal strategy is to scale parameters and data proportionally:
That is, when model parameters double, training data should also double.
Empirical validation: Chinchilla (70B parameters, 1.4T tokens) significantly outperformed Gopher (280B parameters, 300B tokens) under the same compute budget, demonstrating that Gopher was severely under-trained.
| Model | Parameters | Training Tokens | Compute | Conclusion |
|---|---|---|---|---|
| Gopher | 280B | 300B | ~\(C\) | Over-parameterized, data-starved |
| Chinchilla | 70B | 1.4T | ~\(C\) | Compute-optimal configuration |
Training-Optimal vs. Inference-Optimal Trade-off
Chinchilla's "compute-optimal" perspective is framed from the standpoint of training cost. In real-world deployment, however, inference cost often vastly exceeds training cost.
Inference-optimal perspective: Smaller model + more training data = lower inference cost.
This is the design philosophy behind the LLaMA family: LLaMA-7B was trained on 1T tokens, far exceeding the Chinchilla-optimal ratio, but it is more lightweight and efficient at inference time.
Practical trade-offs:
- If a model is trained once but serves billions of inference requests, one should over-train a smaller model
- If the compute budget is limited and only a small number of inferences are needed, the Chinchilla ratio should be followed
- Scaling Laws define the Pareto frontier of training efficiency; real-world deployment requires holistic consideration
B. MoE (Mixture of Experts) Core Mechanisms
The Mathematical Essence of Sparse Activation
The core idea of MoE: not all parameters participate in computing every token.
Given input \(x\), the MoE layer computes:
where \(g_i(x)\) is the gating function (router). For Top-k routing, most \(g_i = 0\).
Switch Transformer (Fedus et al., 2021):
- Extreme simplification: Top-1 routing, each token activates only one expert
- Routing function: \(g(x) = \text{Softmax}(W_g \cdot x)\), taking the argmax
- Advantages: minimal communication, simple implementation
- Challenge: unbalanced expert loads, requiring an auxiliary loss:
where \(f_i\) is the fraction of tokens assigned to expert \(i\), and \(P_i\) is the mean routing probability.
Mixtral 8x7B (Mistral AI, 2024):
- 8 experts, Top-2 activation per token
- Total parameters: 46.7B; active parameters: ~12.9B
- Key innovation: per-layer independent routing, no shared experts
- Matches or exceeds LLaMA-2 70B on most benchmarks
Core Issues in Expert Routing
- Expert Collapse: Some experts are over-utilized while others "starve." The root cause is a positive feedback loop in the router -- experts selected more often receive more gradient updates, making them even more likely to be selected.
- Token Dropping: When the capacity factor is insufficient, overflow tokens are dropped, causing information loss.
- Communication Bottleneck: Expert Parallelism requires All-to-All communication, and bandwidth becomes the limiting factor.
C. Long-Context Architectures
Evolution of Positional Encoding
Traditional Transformers use absolute positional encodings (sinusoidal or learned), but these suffer from poor length extrapolation.
RoPE (Rotary Position Embedding, Su et al., 2021):
Core idea: encode positional information as rotation matrices so that attention scores depend only on relative positions.
Advantages:
- Naturally encodes relative positions
- Can be extended to longer contexts via NTK-aware interpolation
- Adopted by mainstream models including LLaMA, Qwen, and Mistral
ALiBi (Attention with Linear Biases, Press et al., 2022):
Core idea: instead of using positional encodings, directly add a linear bias proportional to distance onto attention scores:
where \(m\) is a per-head hyperparameter slope.
Advantages:
- Zero additional parameters
- Naturally supports length extrapolation
- Extremely simple to implement
Ring Attention
When sequence length exceeds the memory capacity of a single GPU, Ring Attention provides a distributed solution:
- The long sequence is split into chunks distributed across devices
- Each device computes local attention while passing KV blocks in a ring topology
- Computation and communication overlap, theoretically supporting unlimited sequence length
- A key enabling technology for million-token context models such as Gemini
D. Architecture Search and Efficient Architectures
Efficient Transformer Variants
| Method | Core Idea | Complexity | Representative Models |
|---|---|---|---|
| Standard Attention | Full self-attention | \(O(n^2)\) | GPT, BERT |
| Sparse Attention | Local/fixed patterns | \(O(n\sqrt{n})\) | Longformer, BigBird |
| Linear Attention | Kernel approximation | \(O(n)\) | Performer, RWKV |
| Flash Attention | IO-aware exact attention | \(O(n^2)\), but very fast | FlashAttention-2/3 |
| State Space Models | Recurrent state updates | \(O(n)\) | Mamba, S4 |
Flash Attention does not change the mathematical form of attention; instead, it drastically reduces HBM accesses through tiling and kernel fusion. This is a systems-level algorithmic optimization.
Mamba (Gu & Dao, 2023):
- Selective State Space Model (Selective SSM)
- Input-dependent selection mechanism that resolves the inability of traditional SSMs to perform content-aware processing
- Linear time complexity with significant advantages on long-sequence tasks
- Jamba = Mamba + MoE, combining the strengths of both
Neural Architecture Search (NAS) in the Age of Large Models
Traditional NAS is constrained in the large-model era (search spaces are too vast), but certain ideas remain valuable:
- Scaling-aware NAS: Search at small scale, then use Scaling Laws to predict large-scale performance
- Modular search: Fix the overall architecture and search over local modules (e.g., FFN width ratio, number of attention heads)
- Compound Scaling: Jointly optimize depth, width, and resolution (extending the EfficientNet philosophy to LLMs)