vLLM and KV Cache
vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. Its core innovation is the PagedAttention algorithm, which draws inspiration from virtual memory management in operating systems to efficiently manage the KV Cache, resulting in significantly improved inference throughput.
KV Cache: Why Do We Need Caching?
The Redundant Computation Problem in Autoregressive Generation
LLMs generate text in an autoregressive manner, producing one token at a time. Each time a new token is generated, Attention must be computed over all preceding tokens. Without any optimization, generating the \(t\)-th token requires recomputing the Keys and Values for all \(t-1\) previous tokens, leading to massive redundant computation.
The core idea behind KV Cache is straightforward: cache the previously computed Keys and Values, compute only the K and V for each new token, and append them to the cache.
Without KV Cache: Generate token t → Recompute K, V for all t tokens → O(t^2) total computation
With KV Cache: Generate token t → Compute K, V for token t only → O(t) incremental computation
The Memory Problem with KV Cache
While KV Cache reduces computation, it consumes a significant amount of GPU memory. In a Transformer model, every layer and every Attention head requires storing K and V tensors.
Estimated GPU memory for the KV Cache of a single request:
where \(L\) = number of layers, \(H\) = number of attention heads, \(d\) = dimension per head, and \(s\) = sequence length.
Taking LLaMA-13B as an example (40 layers, 40 heads, 128 dimensions, fp16), a single request with 2048 tokens requires approximately 800 MB of KV Cache. As the number of concurrent requests grows or sequences become longer, the KV Cache becomes the primary GPU memory bottleneck.
Waste in Traditional Approaches
Before vLLM, most inference engines pre-allocated a contiguous block of memory for each request's KV Cache. Since the generation length cannot be known in advance, memory is typically pre-allocated based on the maximum possible length, leading to:
- Internal Fragmentation: The actual generated sequence is often much shorter than the maximum length, wasting the excess pre-allocated space
- External Fragmentation: After different requests release their memory, the remaining free space is non-contiguous and cannot be utilized by new requests
- Reservation Waste: Memory is commonly over-allocated as a safety margin
Studies have shown that traditional approaches achieve a GPU memory utilization of only about 20-40% for KV Cache.
PagedAttention: The Core Innovation of vLLM
The Virtual Memory Analogy
PagedAttention draws its inspiration directly from virtual memory and paging mechanisms in operating systems:
| OS Concept | PagedAttention Counterpart |
|---|---|
| Virtual Page | Logical KV Block |
| Physical Frame | Fixed-size block in GPU memory |
| Page Table | Block Table (logical-to-physical mapping) |
| Demand Paging | Allocate new blocks only when tokens are generated |
How It Works
- Block-based Storage: The KV Cache of each sequence is divided into fixed-size Blocks (default: 16 tokens per block), rather than pre-allocating one large contiguous chunk of memory
- Dynamic Mapping: A Block Table records the mapping from each sequence's logical blocks to physical blocks; physical blocks can be scattered anywhere in GPU memory
- On-demand Allocation: Physical blocks are allocated only when new tokens require a new block, and can be freed immediately after use
- Sharing Mechanism: When multiple requests share the same prefix (e.g., the same system prompt), physical blocks can be shared via Copy-on-Write
This design reduces KV Cache memory waste to near zero (only the last block may have minor internal fragmentation), achieving a memory utilization close to 100%.
Continuous Batching
Traditional Static Batching waits for all requests in a batch to finish before processing the next batch. The problem is that different requests vary greatly in generation length — some finish after 10 tokens, while others generate 500. Once short requests complete, they sit idle, leading to low GPU utilization.
vLLM employs Continuous Batching (also known as Iteration-level Batching):
- Scheduling is performed dynamically at each iteration (i.e., each token generation step)
- When a request finishes, a new request immediately fills the vacant slot
- There is no need to wait for the entire batch to complete
Result: In high-concurrency scenarios, throughput improves by 2-4x compared to Static Batching.
Usage
Python API (Offline Inference)
from vllm import LLM, SamplingParams
# 初始化模型
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1, # GPU并行数
gpu_memory_utilization=0.9)
# 设置采样参数
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256
)
# 批量推理
prompts = ["What is deep learning?", "Explain attention mechanism."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Server Mode (Online Serving)
vLLM provides an OpenAI-compatible API server that can serve as a drop-in replacement for the OpenAI API:
# 启动服务
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--tensor-parallel-size 2
Client call (fully compatible with the OpenAI API):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=128
)
print(response.choices[0].message.content)
Comparison with Other Inference Engines
| Feature | vLLM | TGI (HuggingFace) | TensorRT-LLM (NVIDIA) | llama.cpp |
|---|---|---|---|---|
| Core Optimization | PagedAttention | Continuous Batching | Graph optimization + Kernel fusion | Quantization + CPU optimization |
| KV Cache Management | Paged, near-zero waste | Pre-allocated | Pre-allocated + optimized | Simple contiguous allocation |
| Throughput | Very high | High | Highest (on specific hardware) | Lower |
| Ease of Use | Very good (Python-native) | Good (Docker deployment) | More complex (requires compilation) | Very good (single binary) |
| Use Case | High-concurrency cloud serving | Cloud serving | Production with extreme performance needs | Local / edge inference |
| Quantization Support | GPTQ, AWQ, FP8 | GPTQ, BnB | FP8, INT8, INT4 | GGUF (2-8bit) |
| Hardware Requirements | NVIDIA GPU | NVIDIA GPU | NVIDIA GPU | CPU / GPU |
Recommendations:
- High-concurrency online serving → vLLM (ready out of the box with excellent throughput)
- Ultra-low latency optimization → TensorRT-LLM (requires engineering effort for compilation and optimization)
- Rapid prototyping → TGI (well-integrated with the HuggingFace ecosystem)
- Local personal use → llama.cpp / Ollama (see Local Inference Deployment)
References
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023
- vLLM GitHub
- vLLM Official Documentation