Skip to content

Efficient Inference

Overview

Inference cost is the core deployment challenge for large models. This chapter systematically covers quantization, pruning, speculative decoding, KV cache optimization, and other key inference acceleration techniques.


1. Quantization

1.1 Basic Concepts

Convert high-precision (FP16/BF16) weights and activations to low-precision (INT8/INT4) representations.

Uniform Quantization:

\[ x_q = \text{round}\left(\frac{x}{s}\right) + z \]
\[ \hat{x} = s \cdot (x_q - z) \]

where \(s\) is the scale factor and \(z\) is the zero point.

1.2 Post-Training Quantization (PTQ)

Directly quantize pretrained models without retraining.

GPTQ (Frantar et al., 2023):

  • Based on OBS (Optimal Brain Surgeon) framework
  • Quantizes weight matrix column by column
  • Uses Hessian information to compensate for quantization error
  • Supports INT4/INT3 quantization
  • Can quantize 175B models on a single GPU

AWQ (Activation-Aware Weight Quantization, Lin et al., 2024):

  • Observation: 1% of "salient weights" (channels corresponding to large activations) critically affect model quality
  • Rather than keeping them in high precision, uses equivalent scaling to reduce quantization error
  • Multiplies salient channels by a scale factor \(s\), making quantization more friendly
\[ Q(w \cdot s) \cdot (x / s) \approx w \cdot x \]

1.3 INT8 Quantization

LLM.int8() (Dettmers et al., 2022):

  • Discovered that a few "outlier" feature channels have very large values
  • Mixed precision: outlier channels use FP16, rest use INT8
  • Nearly losslessly halves inference memory

1.4 INT4 Quantization

Method Strategy Quality Speed
GPTQ Weight quantization Good Fast
AWQ Activation-aware Better Fast
GGML/GGUF CPU-friendly Good CPU inference
QLoRA 4-bit base + LoRA Good Train + inference
SqueezeLLM Non-uniform quantization Good Medium

1.5 Quantization Precision Comparison

Precision Bytes per Param 7B Model Size Quality Impact
FP16 2 14 GB Baseline
INT8 1 7 GB Minimal
INT4 0.5 3.5 GB Small
INT3 0.375 2.6 GB Moderate
INT2 0.25 1.75 GB Notable

2. Pruning

2.1 SparseGPT

SparseGPT (Frantar & Alistarh, 2023): Post-training pruning for large models.

  • Achieves 50% unstructured sparsity in a single forward pass
  • Near-optimal pruning based on row-wise Hessian inverse approximation
  • Almost no quality loss

2.2 Pruning Types

Type Description Speedup Representative
Unstructured Zero out arbitrary positions Requires sparse hardware SparseGPT, Wanda
Structured Remove entire rows/columns/heads Direct speedup LLM-Pruner
Semi-structured 2:4 sparsity (2 zeros per 4 elements) A100 hardware support -

2.3 Wanda

Wanda (Sun et al., 2023): Simple pruning based on weights and activations.

Pruning metric: \(S_{ij} = |W_{ij}| \cdot \|X_j\|_2\)

Considers both weight magnitude and corresponding activation norm, no Hessian computation needed.


3. Speculative Decoding

3.1 Core Idea

Use a small model to quickly "draft" multiple tokens, then verify in parallel with the large model.

Pipeline:

  1. Draft phase: Small model (draft model) autoregressively generates \(\gamma\) tokens
  2. Verification phase: Large model (target model) evaluates all drafted tokens in parallel
  3. Accept/reject: Decide which tokens to accept based on probability ratios

Acceptance Probability:

\[ p_{\text{accept}} = \min\left(1, \frac{p_{\text{target}}(x_t | x_{<t})}{p_{\text{draft}}(x_t | x_{<t})}\right) \]

3.2 Performance Gains

  • Typical speedup: 2-3x (without changing output distribution)
  • Guarantees same output quality as the large model
  • Higher acceptance rate when the small model better approximates the large model

3.3 Variants

Variant Description
Medusa Add multiple prediction heads to the large model for self-speculation
Eagle Use previous layer features to predict next token
Lookahead Accelerate with n-gram cache
Self-Speculative Use shallow layers as the draft model

4. KV Cache Optimization

4.1 PagedAttention (vLLM)

Problem: KV cache requires contiguous memory; severe fragmentation with concurrent requests.

PagedAttention (Kwon et al., 2023):

  • Borrows the paging mechanism from OS virtual memory
  • KV cache split into fixed-size "pages"
  • Non-contiguous storage, allocated on demand
  • Memory utilization from 20-40% to >95%

4.2 KV Cache Compression

Method Strategy Compression
GQA/MQA Reduce KV heads 4-8x
Quantized KV INT8/INT4 quantize KV cache 2-4x
KV Eviction Drop unimportant KV pairs Variable
StreamingLLM Keep attention sink + sliding window Large compression
H2O Importance-aware KV eviction Variable

4.3 Continuous Batching

  • Don't wait for entire batch to complete; release resources as individual requests finish
  • New requests join immediately
  • Dramatically improves GPU utilization and throughput

5. Other Optimization Techniques

5.1 Operator Fusion

Merge multiple small operators into one large kernel:

  • LayerNorm + Linear fusion
  • QKV projection fusion within attention
  • FlashAttention (attention + softmax + dropout fusion)

5.2 Model Parallel Inference

  • Tensor Parallelism: Shard single layer across multiple GPUs
  • Pipeline Parallelism: Different layers on different GPUs

5.3 Compilation Optimization

Tool Description
TensorRT-LLM NVIDIA's LLM inference optimization
vLLM PagedAttention + Continuous Batching
TGI HuggingFace's inference serving
SGLang Programmable LLM inference
llama.cpp CPU/hybrid inference

6. Hardware Comparison

Hardware Memory FP16 TFLOPS INT8 TOPS Use Case
A100 80GB 80 GB HBM2e 312 624 Training + inference
H100 80GB 80 GB HBM3 990 1979 High-performance inference
L40S 48 GB GDDR6 362 724 Inference-optimized
A10G 24 GB GDDR6 125 250 Medium inference
Apple M2 Ultra 192 GB unified ~27 - Local inference

7. Summary

Technique Speedup Quality Impact Implementation Complexity
INT8 Quantization 2x Minimal Low
INT4 Quantization 3-4x Small Medium
Speculative Decoding 2-3x Zero Medium
PagedAttention 2-4x throughput Zero Low (use vLLM)
Sparse Pruning 1.5-2x Small Medium
Operator Fusion 1.5-2x Zero High

Practical Recommendations:

  1. Start with mature frameworks like vLLM/TGI
  2. INT8 quantization is the lowest-cost optimization
  3. GQA models (LLaMA 2+) are more inference-friendly
  4. Speculative decoding suits latency-sensitive scenarios
  5. Combining multiple techniques yields the best results

References

  • Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023
  • Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024
  • Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023
  • Leviathan et al., "Fast Inference from Transformers via Speculative Decoding," ICML 2023
  • Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023

评论 #