Knowledge Distillation
Overview
Knowledge Distillation transfers knowledge from a large model (teacher) to a smaller model (student), playing a critical role in model compression and deployment optimization. From Hinton's classic method to LLM-era distillation strategies, this chapter systematically covers the theory and practice of knowledge distillation.
1. Classic Knowledge Distillation
1.1 Hinton Distillation (2015)
Core Idea: The teacher model's soft labels contain "dark knowledge" about inter-class relationships.
Distillation Loss:
where:
- \(z_s, z_t\): Student and teacher logits
- \(\sigma\): Softmax function
- \(T\): Temperature parameter (typically \(T=3\sim20\))
- \(\alpha\): Weight between hard and soft label losses
- \(T^2\) factor: Compensates for the gradient scaling effect of temperature
Role of Temperature:
- \(T=1\): Standard softmax
- \(T \to \infty\): Uniform distribution
- Higher \(T\) produces smoother distributions, revealing more inter-class relationships
1.2 Why Are Soft Labels Effective?
A teacher output of \([0.7, 0.2, 0.1]\) contains more information than the hard label \([1, 0, 0]\):
- Class 2 is more similar to class 1 than class 3
- This similarity structure is valuable knowledge
- The student learns not just the "correct answer" but relative relationships between classes
2. Feature Distillation
2.1 FitNets
FitNets (Romero et al., 2015): Distill intermediate layer features, not just outputs.
where \(F_s, F_t\) are student and teacher intermediate features, and \(W_s\) is a projection matrix for dimension matching.
Advantage: Students learn better intermediate representations.
2.2 Attention Distillation
Attention Transfer (Zagoruyko & Komodakis, 2017): Distill attention maps.
where the attention map \(A^l = \sum_c |F^l_c|^2\) (squared activations summed along channel dimension).
2.3 Relational Distillation
RKD (Relational Knowledge Distillation): Distill inter-sample relationships.
where \(\psi\) can represent distance or angular relationships.
2.4 Distillation Methods Summary
| Method | What Is Distilled | Level |
|---|---|---|
| Hinton KD | Soft logits | Output layer |
| FitNets | Intermediate features | Hidden layers |
| Attention Transfer | Attention maps | Attention layers |
| RKD | Inter-sample relations | Representation space |
| CRD | Contrastive representations | Representation space |
| PKD | Patient distillation (multi-layer) | Multiple layers |
3. Knowledge Distillation in NLP
3.1 DistilBERT
DistilBERT (Sanh et al., 2019): A distilled version of BERT.
Design:
- 6 layers (original BERT has 12), 40% fewer parameters
- Distillation loss = soft labels + hard labels + cosine similarity
Result: Retains 97% of BERT's performance, 60% faster.
3.2 TinyBERT
TinyBERT (Jiao et al., 2020): More comprehensive BERT distillation.
Four-Layer Distillation:
- Embedding layer: \(\mathcal{L}_{\text{emb}} = \text{MSE}(E_s W_e, E_t)\)
- Attention layer: \(\mathcal{L}_{\text{attn}} = \text{MSE}(A_s, A_t)\)
- Hidden layer: \(\mathcal{L}_{\text{hid}} = \text{MSE}(H_s W_h, H_t)\)
- Prediction layer: \(\mathcal{L}_{\text{pred}} = \text{KD}\)
Two Stages: General distillation (pretraining phase) + task distillation (fine-tuning phase)
3.3 MiniLM
MiniLM (Wang et al., 2020): Distills Q-K relationships and V-V relationships in self-attention.
4. Knowledge Distillation in the LLM Era
4.1 LLM Distillation Challenges
| Challenge | Description |
|---|---|
| Scale gap | Teachers are often 100B+, students typically 1-7B |
| White-box/Black-box | Many API models don't expose logits |
| Task diversity | LLMs are general-purpose, not single-task |
| Emergent abilities | Small models may not reproduce emergent behaviors |
4.2 White-Box Distillation
When teacher logits are accessible:
Standard Approach:
MiniLLM: Replaces forward KL with reverse KL:
Reverse KL encourages the student to concentrate on high-probability regions of the teacher (mode-seeking), avoiding over-dispersion.
4.3 Black-Box Distillation
When only the teacher's text outputs are available:
- Data generation: Use teacher to generate high-quality training data
- Alpaca/Vicuna approach: Use GPT-4 to generate instruction data for training smaller models
- Self-Instruct: Use teacher to automatically generate instruction-response pairs
4.4 Representative Works
| Method | Teacher | Student | Strategy |
|---|---|---|---|
| Alpaca | text-davinci-003 | LLaMA-7B | Black-box data distillation |
| Vicuna | GPT-4 | LLaMA-13B | Black-box dialogue distillation |
| Orca | GPT-4 | LLaMA-13B | Explanation-augmented distillation |
| MiniLLM | GPT-2 XL | GPT-2 | White-box reverse KL |
| GKD | PaLM | - | Online distillation |
5. Self-Distillation
5.1 Concept
The model serves as its own teacher:
- Born-Again Networks: Train multiple generations, each using the previous as teacher
- DINO/DINOv2: EMA teacher + student self-distillation
- Noisy Student: Self-training + noise
5.2 Self-Distillation in LLMs
- Self-Improve: Model generates data to train itself
- STaR: Self-Taught Reasoner
- Constitutional AI: Self-critique and improvement
6. Practical Guide
6.1 Distillation Hyperparameters
| Hyperparameter | Recommended | Notes |
|---|---|---|
| Temperature \(T\) | 3-20 | Higher = smoother |
| \(\alpha\) | 0.5-0.9 | Soft label weight |
| Student depth | 1/2-2/3 of teacher | Too shallow hurts quality |
| Student width | 1/2-3/4 of teacher | Width matters more than depth |
6.2 Practical Insights
- Teacher-student gap should not be too large (otherwise hard to learn)
- Multi-step distillation (Teacher → TA → Student) can work better
- Data quality matters more than quantity
- Feature distillation often outperforms logits distillation
7. Summary
graph TD
A[Knowledge Distillation] --> B[Logits Distillation]
A --> C[Feature Distillation]
A --> D[Relational Distillation]
A --> E[Data Distillation]
B --> B1[Hinton KD]
B --> B2[MiniLLM]
C --> C1[FitNets]
C --> C2[TinyBERT]
D --> D1[RKD]
D --> D2[CRD]
E --> E1[Alpaca]
E --> E2[Vicuna]
Key Takeaways:
- Classic distillation leverages dark knowledge in soft labels
- Feature distillation transfers intermediate representations
- LLM distillation faces scale and black-box challenges
- Black-box distillation (data generation) is mainstream in the LLM era
- Self-distillation is a growing research direction
References
- Hinton et al., "Distilling the Knowledge in a Neural Network," 2015
- Romero et al., "FitNets: Hints for Thin Deep Nets," ICLR 2015
- Sanh et al., "DistilBERT, a distilled version of BERT," NeurIPS Workshop 2019
- Jiao et al., "TinyBERT: Distilling BERT for Natural Language Understanding," EMNLP 2020
- Gu et al., "MiniLLM: Knowledge Distillation of Large Language Models," ICLR 2024