Efficient Inference
Overview
Inference cost is the core deployment challenge for large models. This chapter systematically covers quantization, pruning, speculative decoding, KV cache optimization, and other key inference acceleration techniques.
1. Quantization
1.1 Basic Concepts
Convert high-precision (FP16/BF16) weights and activations to low-precision (INT8/INT4) representations.
Uniform Quantization:
where \(s\) is the scale factor and \(z\) is the zero point.
1.2 Post-Training Quantization (PTQ)
Directly quantize pretrained models without retraining.
GPTQ (Frantar et al., 2023):
- Based on OBS (Optimal Brain Surgeon) framework
- Quantizes weight matrix column by column
- Uses Hessian information to compensate for quantization error
- Supports INT4/INT3 quantization
- Can quantize 175B models on a single GPU
AWQ (Activation-Aware Weight Quantization, Lin et al., 2024):
- Observation: 1% of "salient weights" (channels corresponding to large activations) critically affect model quality
- Rather than keeping them in high precision, uses equivalent scaling to reduce quantization error
- Multiplies salient channels by a scale factor \(s\), making quantization more friendly
1.3 INT8 Quantization
LLM.int8() (Dettmers et al., 2022):
- Discovered that a few "outlier" feature channels have very large values
- Mixed precision: outlier channels use FP16, rest use INT8
- Nearly losslessly halves inference memory
1.4 INT4 Quantization
| Method | Strategy | Quality | Speed |
|---|---|---|---|
| GPTQ | Weight quantization | Good | Fast |
| AWQ | Activation-aware | Better | Fast |
| GGML/GGUF | CPU-friendly | Good | CPU inference |
| QLoRA | 4-bit base + LoRA | Good | Train + inference |
| SqueezeLLM | Non-uniform quantization | Good | Medium |
1.5 Quantization Precision Comparison
| Precision | Bytes per Param | 7B Model Size | Quality Impact |
|---|---|---|---|
| FP16 | 2 | 14 GB | Baseline |
| INT8 | 1 | 7 GB | Minimal |
| INT4 | 0.5 | 3.5 GB | Small |
| INT3 | 0.375 | 2.6 GB | Moderate |
| INT2 | 0.25 | 1.75 GB | Notable |
2. Pruning
2.1 SparseGPT
SparseGPT (Frantar & Alistarh, 2023): Post-training pruning for large models.
- Achieves 50% unstructured sparsity in a single forward pass
- Near-optimal pruning based on row-wise Hessian inverse approximation
- Almost no quality loss
2.2 Pruning Types
| Type | Description | Speedup | Representative |
|---|---|---|---|
| Unstructured | Zero out arbitrary positions | Requires sparse hardware | SparseGPT, Wanda |
| Structured | Remove entire rows/columns/heads | Direct speedup | LLM-Pruner |
| Semi-structured | 2:4 sparsity (2 zeros per 4 elements) | A100 hardware support | - |
2.3 Wanda
Wanda (Sun et al., 2023): Simple pruning based on weights and activations.
Pruning metric: \(S_{ij} = |W_{ij}| \cdot \|X_j\|_2\)
Considers both weight magnitude and corresponding activation norm, no Hessian computation needed.
3. Speculative Decoding
3.1 Core Idea
Use a small model to quickly "draft" multiple tokens, then verify in parallel with the large model.
Pipeline:
- Draft phase: Small model (draft model) autoregressively generates \(\gamma\) tokens
- Verification phase: Large model (target model) evaluates all drafted tokens in parallel
- Accept/reject: Decide which tokens to accept based on probability ratios
Acceptance Probability:
3.2 Performance Gains
- Typical speedup: 2-3x (without changing output distribution)
- Guarantees same output quality as the large model
- Higher acceptance rate when the small model better approximates the large model
3.3 Variants
| Variant | Description |
|---|---|
| Medusa | Add multiple prediction heads to the large model for self-speculation |
| Eagle | Use previous layer features to predict next token |
| Lookahead | Accelerate with n-gram cache |
| Self-Speculative | Use shallow layers as the draft model |
4. KV Cache Optimization
4.1 PagedAttention (vLLM)
Problem: KV cache requires contiguous memory; severe fragmentation with concurrent requests.
PagedAttention (Kwon et al., 2023):
- Borrows the paging mechanism from OS virtual memory
- KV cache split into fixed-size "pages"
- Non-contiguous storage, allocated on demand
- Memory utilization from 20-40% to >95%
4.2 KV Cache Compression
| Method | Strategy | Compression |
|---|---|---|
| GQA/MQA | Reduce KV heads | 4-8x |
| Quantized KV | INT8/INT4 quantize KV cache | 2-4x |
| KV Eviction | Drop unimportant KV pairs | Variable |
| StreamingLLM | Keep attention sink + sliding window | Large compression |
| H2O | Importance-aware KV eviction | Variable |
4.3 Continuous Batching
- Don't wait for entire batch to complete; release resources as individual requests finish
- New requests join immediately
- Dramatically improves GPU utilization and throughput
5. Other Optimization Techniques
5.1 Operator Fusion
Merge multiple small operators into one large kernel:
- LayerNorm + Linear fusion
- QKV projection fusion within attention
- FlashAttention (attention + softmax + dropout fusion)
5.2 Model Parallel Inference
- Tensor Parallelism: Shard single layer across multiple GPUs
- Pipeline Parallelism: Different layers on different GPUs
5.3 Compilation Optimization
| Tool | Description |
|---|---|
| TensorRT-LLM | NVIDIA's LLM inference optimization |
| vLLM | PagedAttention + Continuous Batching |
| TGI | HuggingFace's inference serving |
| SGLang | Programmable LLM inference |
| llama.cpp | CPU/hybrid inference |
6. Hardware Comparison
| Hardware | Memory | FP16 TFLOPS | INT8 TOPS | Use Case |
|---|---|---|---|---|
| A100 80GB | 80 GB HBM2e | 312 | 624 | Training + inference |
| H100 80GB | 80 GB HBM3 | 990 | 1979 | High-performance inference |
| L40S | 48 GB GDDR6 | 362 | 724 | Inference-optimized |
| A10G | 24 GB GDDR6 | 125 | 250 | Medium inference |
| Apple M2 Ultra | 192 GB unified | ~27 | - | Local inference |
7. Summary
| Technique | Speedup | Quality Impact | Implementation Complexity |
|---|---|---|---|
| INT8 Quantization | 2x | Minimal | Low |
| INT4 Quantization | 3-4x | Small | Medium |
| Speculative Decoding | 2-3x | Zero | Medium |
| PagedAttention | 2-4x throughput | Zero | Low (use vLLM) |
| Sparse Pruning | 1.5-2x | Small | Medium |
| Operator Fusion | 1.5-2x | Zero | High |
Practical Recommendations:
- Start with mature frameworks like vLLM/TGI
- INT8 quantization is the lowest-cost optimization
- GQA models (LLaMA 2+) are more inference-friendly
- Speculative decoding suits latency-sensitive scenarios
- Combining multiple techniques yields the best results
References
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023
- Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024
- Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023
- Leviathan et al., "Fast Inference from Transformers via Speculative Decoding," ICML 2023
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023