Skip to content

Model Compression Overview

Why Model Compression Matters

Modern deep learning models have grown explosively in parameter count and computational cost. GPT-3, for example, has 175 billion parameters and demands enormous compute for a single inference pass. In real-world deployment, we face strict constraints:

  • Latency constraints: Real-time inference scenarios (autonomous driving, voice assistants) require millisecond-level responses
  • Memory constraints: Edge devices (smartphones, IoT) typically have only hundreds of MB to a few GB of memory
  • Power constraints: Mobile and embedded devices are extremely power-sensitive
  • Cost constraints: Cloud GPU inference costs scale linearly with model size

The goal of model compression is to reduce model size, lower computational cost, and accelerate inference while preserving accuracy as much as possible.


Knowledge Distillation

Core Idea

Knowledge distillation, proposed by Hinton et al. in 2015, transfers the "knowledge" of a large model (Teacher) into a small model (Student). The Teacher's soft labels encode inter-class similarity information, making them far more informative than one-hot hard labels.

Teacher-Student Framework

Given the Teacher's logits \(z_T\) and the Student's logits \(z_S\), the distillation loss is defined as:

\[ \mathcal{L}_{\text{KD}} = \alpha \cdot T^2 \cdot D_{KL}\!\Big(\sigma\!\big(\frac{z_T}{T}\big) \;\Big\|\; \sigma\!\big(\frac{z_S}{T}\big)\Big) + (1 - \alpha) \cdot \mathcal{L}_{\text{CE}}(y, \sigma(z_S)) \]

where:

  • \(T\) is the temperature parameter; when \(T > 1\), the softmax output is smoother, revealing more inter-class relationships
  • \(\alpha\) balances the distillation loss and the hard-label loss
  • \(\sigma\) denotes the softmax function
  • \(\mathcal{L}_{\text{CE}}\) is the standard cross-entropy loss

Case Study: DistilBERT

DistilBERT compresses BERT-base (110M parameters) into a 6-layer Student model (66M parameters), retaining 97% of language understanding capability while achieving a 60% speedup in inference. Its distillation strategy includes:

  1. Hidden-state distillation: Aligning hidden representations between Teacher and Student
  2. Attention distillation: Aligning attention weight matrices
  3. MLM loss: Retaining the masked language modeling objective

Pruning

Unstructured Pruning

Unstructured pruning operates at the individual weight granularity, setting weights with small absolute values to zero:

\[ w_{ij} = \begin{cases} w_{ij}, & \text{if } |w_{ij}| \geq \theta \\ 0, & \text{if } |w_{ij}| < \theta \end{cases} \]

where \(\theta\) is the pruning threshold. This approach can achieve very high sparsity ratios (e.g., 90%+), but the resulting irregular sparse matrices require specialized hardware or sparse computation libraries to actually achieve speedups.

Representative methods: Magnitude Pruning, Movement Pruning (for the fine-tuning stage)

Structured Pruning

Structured pruning removes entire channels, attention heads, or layers. The pruned model can be directly accelerated on standard hardware without specialized support.

Common importance criteria:

  • \(\ell_1\)-norm: Remove convolutional filters / channels with the smallest \(\ell_1\) norm
  • Taylor expansion: Evaluate importance via first- or second-order Taylor expansion of the loss function with respect to parameters
  • BN scaling factor: Use the \(\gamma\) parameter from Batch Normalization layers as a channel importance indicator

The Lottery Ticket Hypothesis

Frankle & Carlin (2019) proposed that a randomly initialized dense network contains a sparse subnetwork (a "winning ticket") which, when trained from the same initialization, can match the full network's accuracy in comparable or fewer iterations.

Core procedure:

  1. Randomly initialize the network; record initial weights \(w_0\)
  2. Train the network to convergence, obtaining \(w_f\)
  3. Prune \(p\%\) of weights by magnitude, producing mask \(m\)
  4. Reset the remaining weights to \(w_0\) and retrain using \(m \odot w_0\)
  5. Iterate (Iterative Magnitude Pruning, IMP)

This finding has profound implications for understanding redundancy and generalization in neural networks.


Quantization

Quantization compresses models by reducing the numerical precision of weights and activations, typically converting FP32 to INT8 or even INT4.

Uniform Quantization Formula

\[ x_q = \text{round}\!\Big(\frac{x}{S}\Big) + Z, \quad S = \frac{x_{\max} - x_{\min}}{2^b - 1} \]

where \(S\) is the scale factor, \(Z\) is the zero-point, and \(b\) is the target bit-width.

Two Major Paradigms

Paradigm Full Name Requires Training Accuracy Retention Representative Tools
PTQ Post-Training Quantization No Moderate TensorRT, GPTQ, AWQ
QAT Quantization-Aware Training Yes High PyTorch QAT, LSQ

PTQ requires only a small calibration dataset and is suitable for rapid deployment; QAT simulates quantization error during training for better accuracy retention but at higher cost.

For a detailed treatment of quantization (including PTQ calibration strategies, the Straight-Through Estimator in QAT, mixed-precision quantization, etc.), see Quantization.


Low-Rank Factorization

Low-rank factorization uses matrix decomposition to approximate a large weight matrix as a product of smaller matrices, thereby reducing both parameter count and computation.

For a weight matrix \(W \in \mathbb{R}^{m \times n}\), a rank-\(r\) approximation is:

\[ W \approx U V, \quad U \in \mathbb{R}^{m \times r}, \; V \in \mathbb{R}^{r \times n}, \; r \ll \min(m, n) \]

Parameter reduction: From \(mn\) down to \(r(m+n)\). The compression is significant when \(r \ll \frac{mn}{m+n}\).

Common Methods

  • SVD decomposition: Perform singular value decomposition \(W = U \Sigma V^T\) on pretrained weights; retain the top \(r\) singular values
  • Tucker decomposition: Higher-order tensor decomposition for convolutional kernels
  • LoRA: During fine-tuning, freeze the original weights \(W\) and learn only a low-rank update \(\Delta W = BA\), where \(B \in \mathbb{R}^{m \times r}\), \(A \in \mathbb{R}^{r \times n}\)

Note: LoRA is strictly a parameter-efficient fine-tuning (PEFT) method, but its underlying principle is precisely low-rank factorization.


Method Comparison

Method Compression Ratio Accuracy Loss Retraining Required Hardware Friendly Typical Use Case
Knowledge Distillation 2x-10x Low Yes (train Student) High Deploying lightweight models
Unstructured Pruning 10x-100x Low-Medium Usually fine-tuning Low (needs sparse HW) Research, specialized accelerators
Structured Pruning 2x-5x Low-Medium Usually fine-tuning High General inference acceleration
PTQ 2x-4x Low-Medium No High Rapid deployment
QAT 2x-4x Low Yes High Accuracy-sensitive scenarios
Low-Rank Factorization 2x-5x Medium Optional High FC / attention layer compression

Combining Techniques

In practice, multiple compression techniques are often used together:

  1. First, use knowledge distillation to obtain a smaller Student model
  2. Apply structured pruning to the Student to remove redundant channels
  3. Finally, use PTQ to convert the model to INT8 for deployment

This pipeline approach can achieve 10x-50x overall compression ratios while maintaining accuracy.


References

  • Hinton et al., Distilling the Knowledge in a Neural Network, 2015
  • Han et al., Deep Compression, ICLR 2016
  • Frankle & Carlin, The Lottery Ticket Hypothesis, ICLR 2019
  • Sanh et al., DistilBERT, NeurIPS Workshop 2019
  • Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2022

评论 #