Efficient Inference

Overview

Inference cost is the core deployment challenge for large models. This chapter systematically covers quantization, pruning, speculative decoding, KV cache optimization, and other key inference acceleration techniques.

1. Quantization

1.1 Basic Concepts

Convert high-precision (FP16/BF16) weights and activations to low-precision (INT8/INT4) representations.

Uniform Quantization:

\[ x_q = \text{round}\left(\frac{x}{s}\right) + z \]

\[ \hat{x} = s \cdot (x_q - z) \]

where \(s\) is the scale factor and \(z\) is the zero point.

1.2 Post-Training Quantization (PTQ)

Directly quantize pretrained models without retraining.

GPTQ (Frantar et al., 2023):

Based on OBS (Optimal Brain Surgeon) framework
Quantizes weight matrix column by column
Uses Hessian information to compensate for quantization error
Supports INT4/INT3 quantization
Can quantize 175B models on a single GPU

AWQ (Activation-Aware Weight Quantization, Lin et al., 2024):

Observation: 1% of "salient weights" (channels corresponding to large activations) critically affect model quality
Rather than keeping them in high precision, uses equivalent scaling to reduce quantization error
Multiplies salient channels by a scale factor \(s\), making quantization more friendly

\[ Q(w \cdot s) \cdot (x / s) \approx w \cdot x \]

1.3 INT8 Quantization

LLM.int8() (Dettmers et al., 2022):

Discovered that a few "outlier" feature channels have very large values
Mixed precision: outlier channels use FP16, rest use INT8
Nearly losslessly halves inference memory

1.4 INT4 Quantization

Method	Strategy	Quality	Speed
GPTQ	Weight quantization	Good	Fast
AWQ	Activation-aware	Better	Fast
GGML/GGUF	CPU-friendly	Good	CPU inference
QLoRA	4-bit base + LoRA	Good	Train + inference
SqueezeLLM	Non-uniform quantization	Good	Medium

1.5 Quantization Precision Comparison

Precision	Bytes per Param	7B Model Size	Quality Impact
FP16	2	14 GB	Baseline
INT8	1	7 GB	Minimal
INT4	0.5	3.5 GB	Small
INT3	0.375	2.6 GB	Moderate
INT2	0.25	1.75 GB	Notable

2. Pruning

2.1 SparseGPT

SparseGPT (Frantar & Alistarh, 2023): Post-training pruning for large models.

Achieves 50% unstructured sparsity in a single forward pass
Near-optimal pruning based on row-wise Hessian inverse approximation
Almost no quality loss

2.2 Pruning Types

Type	Description	Speedup	Representative
Unstructured	Zero out arbitrary positions	Requires sparse hardware	SparseGPT, Wanda
Structured	Remove entire rows/columns/heads	Direct speedup	LLM-Pruner
Semi-structured	2:4 sparsity (2 zeros per 4 elements)	A100 hardware support	-

2.3 Wanda

Wanda (Sun et al., 2023): Simple pruning based on weights and activations.

Pruning metric: \(S_{ij} = |W_{ij}| \cdot \|X_j\|_2\)

Considers both weight magnitude and corresponding activation norm, no Hessian computation needed.

3. Speculative Decoding

3.1 Core Idea

Use a small model to quickly "draft" multiple tokens, then verify in parallel with the large model.

Pipeline:

Draft phase: Small model (draft model) autoregressively generates \(\gamma\) tokens
Verification phase: Large model (target model) evaluates all drafted tokens in parallel
Accept/reject: Decide which tokens to accept based on probability ratios

Acceptance Probability:

\[ p_{\text{accept}} = \min\left(1, \frac{p_{\text{target}}(x_t | x_{<t})}{p_{\text{draft}}(x_t | x_{<t})}\right) \]

3.2 Performance Gains

Typical speedup: 2-3x (without changing output distribution)
Guarantees same output quality as the large model
Higher acceptance rate when the small model better approximates the large model

3.3 Variants

Variant	Description
Medusa	Add multiple prediction heads to the large model for self-speculation
Eagle	Use previous layer features to predict next token
Lookahead	Accelerate with n-gram cache
Self-Speculative	Use shallow layers as the draft model

4. KV Cache Optimization

4.1 PagedAttention (vLLM)

Problem: KV cache requires contiguous memory; severe fragmentation with concurrent requests.

PagedAttention (Kwon et al., 2023):

Borrows the paging mechanism from OS virtual memory
KV cache split into fixed-size "pages"
Non-contiguous storage, allocated on demand
Memory utilization from 20-40% to >95%

4.2 KV Cache Compression

Method	Strategy	Compression
GQA/MQA	Reduce KV heads	4-8x
Quantized KV	INT8/INT4 quantize KV cache	2-4x
KV Eviction	Drop unimportant KV pairs	Variable
StreamingLLM	Keep attention sink + sliding window	Large compression
H2O	Importance-aware KV eviction	Variable

4.3 Continuous Batching

Don't wait for entire batch to complete; release resources as individual requests finish
New requests join immediately
Dramatically improves GPU utilization and throughput

5. Other Optimization Techniques

5.1 Operator Fusion

Merge multiple small operators into one large kernel:

LayerNorm + Linear fusion
QKV projection fusion within attention
FlashAttention (attention + softmax + dropout fusion)

5.2 Model Parallel Inference

Tensor Parallelism: Shard single layer across multiple GPUs
Pipeline Parallelism: Different layers on different GPUs

5.3 Compilation Optimization

Tool	Description
TensorRT-LLM	NVIDIA's LLM inference optimization
vLLM	PagedAttention + Continuous Batching
TGI	HuggingFace's inference serving
SGLang	Programmable LLM inference
llama.cpp	CPU/hybrid inference

6. Hardware Comparison

Hardware	Memory	FP16 TFLOPS	INT8 TOPS	Use Case
A100 80GB	80 GB HBM2e	312	624	Training + inference
H100 80GB	80 GB HBM3	990	1979	High-performance inference
L40S	48 GB GDDR6	362	724	Inference-optimized
A10G	24 GB GDDR6	125	250	Medium inference
Apple M2 Ultra	192 GB unified	~27	-	Local inference

7. Summary

Technique	Speedup	Quality Impact	Implementation Complexity
INT8 Quantization	2x	Minimal	Low
INT4 Quantization	3-4x	Small	Medium
Speculative Decoding	2-3x	Zero	Medium
PagedAttention	2-4x throughput	Zero	Low (use vLLM)
Sparse Pruning	1.5-2x	Small	Medium
Operator Fusion	1.5-2x	Zero	High

Practical Recommendations:

Start with mature frameworks like vLLM/TGI
INT8 quantization is the lowest-cost optimization
GQA models (LLaMA 2+) are more inference-friendly
Speculative decoding suits latency-sensitive scenarios
Combining multiple techniques yields the best results

References

Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," ICLR 2023
Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," MLSys 2024
Frantar & Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," ICML 2023
Leviathan et al., "Fast Inference from Transformers via Speculative Decoding," ICML 2023
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023