Skip to content

Model Compression

Common model compression techniques include pruning, knowledge distillation, and quantization (PTQ, QAT).

PTQ Quantization

PTQ (Post-Training Quantization) is the most widely adopted and efficient approach in industry. Its core idea is to leave the model parameters untouched and only find the appropriate scaling factors.

Typical implementations: TensorRT, OpenVINO, TFLite

  • Weight conversion: Weights are static, so we simply collect the min/max statistics of each layer, compute a scale factor \(S\), and convert floating-point values to integers.
  • Activation calibration: Activations are dynamic and depend on the input data. A small set of real data (typically 100–500 samples) is fed through the model to profile the output distribution of each layer and determine the quantization range for activations.

Common strategies:

  • Min-Max clipping: Directly uses the observed minimum and maximum values, but is susceptible to distortion by outliers.
  • KL divergence (Entropy): Searches for a clipping threshold that minimizes the information-entropy loss between the original and quantized distributions (commonly used by NVIDIA TensorRT).

Pros and cons:

  • Pros: Can be completed in just a few minutes; requires no training code — only an inference framework; does not need labeled data.
  • Cons: For models that are highly sensitive to numerical distributions, such as MobileNet (depthwise separable convolutions) or Transformers, accuracy can degrade dramatically.

QAT Quantization

When PTQ cannot preserve acceptable accuracy, QAT becomes necessary. It incorporates quantization simulation into the training process.

Typical implementations: PyTorch QAT, TensorFlow QAT

Core workflow:

  1. Insert fake-quantization nodes (Fake Quant): In the training graph, a quantization "simulator" is inserted before and after every convolutional or fully connected layer.
  2. Forward pass (simulated distortion): Before computation, weights and activations are simulated as INT8 values and then immediately converted back to FP32. This way, the model sees "staircase-like" and "clipped" numerical values during training.
  3. Backward pass (Straight-Through Estimator, STE): This is a key technical challenge. Because the quantization function (a step function) is non-differentiable — its derivative is zero — we use the STE (Straight-Through Estimator), which bypasses the zero-gradient problem by letting gradients pass straight through the quantization layer and directly update the original FP32 weights.

Pros and cons:

  • Pros: Achieves the accuracy ceiling. Through training, the model learns to compensate for quantization-induced errors and can reach accuracy nearly on par with FP32.
  • Cons: High cost. A full training environment, labeled datasets, and significantly more compute time than PTQ are required.

评论 #