Inference Quantization

Quantization is the technique of converting model weights and/or activations from high precision (FP32/FP16) to low precision (INT8/INT4) representations. In the context of inference, the primary goals of quantization are to reduce memory footprint and accelerate inference, enabling large models to run on limited hardware.

For training-time quantization techniques (PTQ, QAT), see Model Compression. This document focuses on quantization methods and formats commonly used in LLM inference deployment.

Fundamental Concepts

Quantization Mapping

At its core, quantization maps continuous floating-point values to discrete integer values:

\[ x_q = \text{round}\left(\frac{x}{\Delta}\right) + z, \quad \Delta = \frac{x_{\max} - x_{\min}}{2^b - 1} \]

where \(\Delta\) is the scale factor, \(z\) is the zero-point, and \(b\) is the quantization bit-width.

Symmetric vs. Asymmetric Quantization

Symmetric quantization: \(z = 0\), mapping the floating-point range symmetrically to \([-2^{b-1}, 2^{b-1}-1]\). Simple to implement and computationally efficient.
Asymmetric quantization: \(z \neq 0\), which better utilizes the quantization range and is suitable for activation distributions that are not symmetric.

Quantization Granularity

Per-tensor: A single set of \(\Delta, z\) is shared across the entire tensor. The simplest approach but yields the lowest accuracy.
Per-channel: Each output channel is quantized independently. This is the standard practice for weight quantization.
Per-group: Every \(g\) elements form a group (e.g., groups of 128). This strikes a balance between accuracy and overhead. GPTQ/AWQ commonly use \(g=128\).

Mainstream LLM Quantization Methods

GPTQ (GPT Quantization)

GPTQ is based on the Optimal Brain Quantization framework. It quantizes weights layer by layer while compensating for quantization error to preserve accuracy.

Core idea: When quantizing a given weight, the error it introduces is distributed across the remaining unquantized weights in the same row, minimizing the overall output error of the layer.

Procedure:

Collect a small calibration dataset (typically 128 samples) and compute the Hessian matrix \(H = 2X^TX\) for each layer.
Quantize weights column by column in sequential order.
For each quantized weight, optimally redistribute the error to unquantized weights using Hessian information.
Use Cholesky decomposition to accelerate the computation of the Hessian inverse.

Characteristics:

Supports 4-bit / 3-bit / 2-bit quantization
Quantization completes in just a few minutes (on a single GPU)
Minimal accuracy loss at 4-bit, close to the FP16 baseline
Strong ecosystem support: AutoGPTQ, native Transformers integration

# 使用 AutoGPTQ 量化
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantize_config=quantize_config
)
model.quantize(calibration_dataset)
model.save_quantized("llama2-7b-gptq-4bit")

AWQ (Activation-aware Weight Quantization)

The key observation behind AWQ is that not all weights are equally important — a small number of "salient weights" have a disproportionate impact on model output.

Core idea: By analyzing the distribution of activation values, AWQ identifies salient weight channels and applies scaling to protect them, concentrating quantization error on less important weights.

Procedure:

Use calibration data to measure the activation magnitude for each weight channel.
Channels with large activations correspond to "salient weights" and are multiplied by a scaling factor \(s > 1\).
The corresponding activations are divided by \(s\) to maintain mathematical equivalence.
The scaled weights have a larger numerical range, resulting in smaller relative quantization error.

Advantages over GPTQ:

No inverse Hessian computation required, leading to faster quantization
Quantized models are more robust with better cross-task generalization
The advantage is more pronounced at lower bit-widths (3-bit)

GGUF (GPT-Generated Unified Format)

GGUF is the quantization format used in the llama.cpp ecosystem, specifically designed for CPU and edge-device inference.

Characteristics:

Supports multiple quantization types: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, etc.
The "K-Quant" series uses per-block quantization with importance weighting, offering better quality than the earlier Q4_0/Q4_1 formats
Single-file format containing model weights, tokenizer, and metadata for easy distribution
Supports mixed precision: different layers can use different quantization bit-widths

Common quantization levels:

格式	位宽	显存/内存 (7B)	质量
Q2_K	~2.6 bit	~3 GB	Usable but with noticeable degradation
Q4_K_M	~4.8 bit	~4.5 GB	Recommended: good balance of quality and size
Q5_K_M	~5.5 bit	~5.3 GB	High quality, close to FP16
Q6_K	~6.6 bit	~5.9 GB	Near lossless
Q8_0	8 bit	~7.2 GB	Virtually lossless

# 使用 llama.cpp 量化
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_M

FP8 Quantization

FP8 (8-bit floating point) is a data type natively supported by NVIDIA's Hopper architecture (H100) and comes in two formats:

E4M3 (4-bit exponent, 3-bit mantissa): Larger numerical range, suitable for weights and activations
E5M2 (5-bit exponent, 2-bit mantissa): Greater dynamic range, suitable for gradients

Advantages: Compared to INT8, FP8 can be applied directly without calibration data, and the H100's FP8 Tensor Core throughput is 2x that of FP16. Both vLLM and TensorRT-LLM support FP8 inference.

Comparison of Quantization Methods

Method	Typical Bit-width	Calibration Data Required	Quantization Speed	Inference Framework	Use Case
GPTQ	4/3/2 bit	Yes (~128 samples)	Medium	vLLM, TGI, Transformers	GPU deployment
AWQ	4 bit	Yes (~128 samples)	Fast	vLLM, TGI, TensorRT-LLM	GPU deployment
GGUF	2-8 bit	No	Fast	llama.cpp, Ollama	CPU/edge deployment
FP8	8 bit	No	Instant	vLLM, TensorRT-LLM	H100 GPU
BitsAndBytes	4/8 bit	No	Instant	Transformers, TGI	Research/prototyping

Selection guidelines:

High-concurrency GPU serving → AWQ or GPTQ 4-bit (deployed with vLLM)
NVIDIA H100 → FP8 (native support, no accuracy loss)
Local CPU/Mac → GGUF Q4_K_M (Ollama / llama.cpp)
Quick experimentation → BitsAndBytes NF4 (one-line loading)

Evaluating Quantization Quality

When evaluating quantized models, the following metrics are typically considered:

Perplexity (PPL): Measured on datasets such as WikiText-2; lower PPL is better
Downstream task accuracy: Measured on benchmarks such as MMLU, HellaSwag, etc.
Inference speed: Throughput in tokens/s
Memory usage: Peak GPU memory consumption

As a rule of thumb, 4-bit quantization typically increases PPL by 0.1-0.5 (relative to FP16), which is imperceptible for most application scenarios.

References

Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", ICLR 2023
Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration", MLSys 2024
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models", NeurIPS 2023
llama.cpp - GGUF 格式说明