Model Compression Overview

Why Model Compression Matters

Modern deep learning models have grown explosively in parameter count and computational cost. GPT-3, for example, has 175 billion parameters and demands enormous compute for a single inference pass. In real-world deployment, we face strict constraints:

Latency constraints: Real-time inference scenarios (autonomous driving, voice assistants) require millisecond-level responses
Memory constraints: Edge devices (smartphones, IoT) typically have only hundreds of MB to a few GB of memory
Power constraints: Mobile and embedded devices are extremely power-sensitive
Cost constraints: Cloud GPU inference costs scale linearly with model size

The goal of model compression is to reduce model size, lower computational cost, and accelerate inference while preserving accuracy as much as possible.

Knowledge Distillation

Core Idea

Knowledge distillation, proposed by Hinton et al. in 2015, transfers the "knowledge" of a large model (Teacher) into a small model (Student). The Teacher's soft labels encode inter-class similarity information, making them far more informative than one-hot hard labels.

Teacher-Student Framework

Given the Teacher's logits \(z_T\) and the Student's logits \(z_S\), the distillation loss is defined as:

\[ \mathcal{L}_{\text{KD}} = \alpha \cdot T^2 \cdot D_{KL}\!\Big(\sigma\!\big(\frac{z_T}{T}\big) \;\Big\|\; \sigma\!\big(\frac{z_S}{T}\big)\Big) + (1 - \alpha) \cdot \mathcal{L}_{\text{CE}}(y, \sigma(z_S)) \]

where:

\(T\) is the temperature parameter; when \(T > 1\), the softmax output is smoother, revealing more inter-class relationships
\(\alpha\) balances the distillation loss and the hard-label loss
\(\sigma\) denotes the softmax function
\(\mathcal{L}_{\text{CE}}\) is the standard cross-entropy loss

Case Study: DistilBERT

DistilBERT compresses BERT-base (110M parameters) into a 6-layer Student model (66M parameters), retaining 97% of language understanding capability while achieving a 60% speedup in inference. Its distillation strategy includes:

Hidden-state distillation: Aligning hidden representations between Teacher and Student
Attention distillation: Aligning attention weight matrices
MLM loss: Retaining the masked language modeling objective

Pruning

Unstructured Pruning

Unstructured pruning operates at the individual weight granularity, setting weights with small absolute values to zero:

\[ w_{ij} = \begin{cases} w_{ij}, & \text{if } |w_{ij}| \geq \theta \\ 0, & \text{if } |w_{ij}| < \theta \end{cases} \]

where \(\theta\) is the pruning threshold. This approach can achieve very high sparsity ratios (e.g., 90%+), but the resulting irregular sparse matrices require specialized hardware or sparse computation libraries to actually achieve speedups.

Representative methods: Magnitude Pruning, Movement Pruning (for the fine-tuning stage)

Structured Pruning

Structured pruning removes entire channels, attention heads, or layers. The pruned model can be directly accelerated on standard hardware without specialized support.

Common importance criteria:

\(\ell_1\)-norm: Remove convolutional filters / channels with the smallest \(\ell_1\) norm
Taylor expansion: Evaluate importance via first- or second-order Taylor expansion of the loss function with respect to parameters
BN scaling factor: Use the \(\gamma\) parameter from Batch Normalization layers as a channel importance indicator

The Lottery Ticket Hypothesis

Frankle & Carlin (2019) proposed that a randomly initialized dense network contains a sparse subnetwork (a "winning ticket") which, when trained from the same initialization, can match the full network's accuracy in comparable or fewer iterations.

Core procedure:

Randomly initialize the network; record initial weights \(w_0\)
Train the network to convergence, obtaining \(w_f\)
Prune \(p\%\) of weights by magnitude, producing mask \(m\)
Reset the remaining weights to \(w_0\) and retrain using \(m \odot w_0\)
Iterate (Iterative Magnitude Pruning, IMP)

This finding has profound implications for understanding redundancy and generalization in neural networks.

Quantization

Quantization compresses models by reducing the numerical precision of weights and activations, typically converting FP32 to INT8 or even INT4.

Uniform Quantization Formula

\[ x_q = \text{round}\!\Big(\frac{x}{S}\Big) + Z, \quad S = \frac{x_{\max} - x_{\min}}{2^b - 1} \]

where \(S\) is the scale factor, \(Z\) is the zero-point, and \(b\) is the target bit-width.

Two Major Paradigms

Paradigm	Full Name	Requires Training	Accuracy Retention	Representative Tools
PTQ	Post-Training Quantization	No	Moderate	TensorRT, GPTQ, AWQ
QAT	Quantization-Aware Training	Yes	High	PyTorch QAT, LSQ

PTQ requires only a small calibration dataset and is suitable for rapid deployment; QAT simulates quantization error during training for better accuracy retention but at higher cost.

For a detailed treatment of quantization (including PTQ calibration strategies, the Straight-Through Estimator in QAT, mixed-precision quantization, etc.), see Quantization.

Low-Rank Factorization

Low-rank factorization uses matrix decomposition to approximate a large weight matrix as a product of smaller matrices, thereby reducing both parameter count and computation.

For a weight matrix \(W \in \mathbb{R}^{m \times n}\), a rank-\(r\) approximation is:

\[ W \approx U V, \quad U \in \mathbb{R}^{m \times r}, \; V \in \mathbb{R}^{r \times n}, \; r \ll \min(m, n) \]

Parameter reduction: From \(mn\) down to \(r(m+n)\). The compression is significant when \(r \ll \frac{mn}{m+n}\).

Common Methods

SVD decomposition: Perform singular value decomposition \(W = U \Sigma V^T\) on pretrained weights; retain the top \(r\) singular values
Tucker decomposition: Higher-order tensor decomposition for convolutional kernels
LoRA: During fine-tuning, freeze the original weights \(W\) and learn only a low-rank update \(\Delta W = BA\), where \(B \in \mathbb{R}^{m \times r}\), \(A \in \mathbb{R}^{r \times n}\)

Note: LoRA is strictly a parameter-efficient fine-tuning (PEFT) method, but its underlying principle is precisely low-rank factorization.

Method Comparison

Method	Compression Ratio	Accuracy Loss	Retraining Required	Hardware Friendly	Typical Use Case
Knowledge Distillation	2x-10x	Low	Yes (train Student)	High	Deploying lightweight models
Unstructured Pruning	10x-100x	Low-Medium	Usually fine-tuning	Low (needs sparse HW)	Research, specialized accelerators
Structured Pruning	2x-5x	Low-Medium	Usually fine-tuning	High	General inference acceleration
PTQ	2x-4x	Low-Medium	No	High	Rapid deployment
QAT	2x-4x	Low	Yes	High	Accuracy-sensitive scenarios
Low-Rank Factorization	2x-5x	Medium	Optional	High	FC / attention layer compression

Combining Techniques

In practice, multiple compression techniques are often used together:

First, use knowledge distillation to obtain a smaller Student model
Apply structured pruning to the Student to remove redundant channels
Finally, use PTQ to convert the model to INT8 for deployment

This pipeline approach can achieve 10x-50x overall compression ratios while maintaining accuracy.

References

Hinton et al., Distilling the Knowledge in a Neural Network, 2015
Han et al., Deep Compression, ICLR 2016
Frankle & Carlin, The Lottery Ticket Hypothesis, ICLR 2019
Sanh et al., DistilBERT, NeurIPS Workshop 2019
Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2022