Skip to content

vLLM and KV Cache

vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. Its core innovation is the PagedAttention algorithm, which draws inspiration from virtual memory management in operating systems to efficiently manage the KV Cache, resulting in significantly improved inference throughput.


KV Cache: Why Do We Need Caching?

The Redundant Computation Problem in Autoregressive Generation

LLMs generate text in an autoregressive manner, producing one token at a time. Each time a new token is generated, Attention must be computed over all preceding tokens. Without any optimization, generating the \(t\)-th token requires recomputing the Keys and Values for all \(t-1\) previous tokens, leading to massive redundant computation.

The core idea behind KV Cache is straightforward: cache the previously computed Keys and Values, compute only the K and V for each new token, and append them to the cache.

Without KV Cache: Generate token t → Recompute K, V for all t tokens → O(t^2) total computation
With KV Cache:    Generate token t → Compute K, V for token t only  → O(t) incremental computation

The Memory Problem with KV Cache

While KV Cache reduces computation, it consumes a significant amount of GPU memory. In a Transformer model, every layer and every Attention head requires storing K and V tensors.

Estimated GPU memory for the KV Cache of a single request:

\[ \text{KV Cache Size} = 2 \times L \times H \times d \times s \times \text{sizeof(dtype)} \]

where \(L\) = number of layers, \(H\) = number of attention heads, \(d\) = dimension per head, and \(s\) = sequence length.

Taking LLaMA-13B as an example (40 layers, 40 heads, 128 dimensions, fp16), a single request with 2048 tokens requires approximately 800 MB of KV Cache. As the number of concurrent requests grows or sequences become longer, the KV Cache becomes the primary GPU memory bottleneck.

Waste in Traditional Approaches

Before vLLM, most inference engines pre-allocated a contiguous block of memory for each request's KV Cache. Since the generation length cannot be known in advance, memory is typically pre-allocated based on the maximum possible length, leading to:

  • Internal Fragmentation: The actual generated sequence is often much shorter than the maximum length, wasting the excess pre-allocated space
  • External Fragmentation: After different requests release their memory, the remaining free space is non-contiguous and cannot be utilized by new requests
  • Reservation Waste: Memory is commonly over-allocated as a safety margin

Studies have shown that traditional approaches achieve a GPU memory utilization of only about 20-40% for KV Cache.


PagedAttention: The Core Innovation of vLLM

The Virtual Memory Analogy

PagedAttention draws its inspiration directly from virtual memory and paging mechanisms in operating systems:

OS Concept PagedAttention Counterpart
Virtual Page Logical KV Block
Physical Frame Fixed-size block in GPU memory
Page Table Block Table (logical-to-physical mapping)
Demand Paging Allocate new blocks only when tokens are generated

How It Works

  1. Block-based Storage: The KV Cache of each sequence is divided into fixed-size Blocks (default: 16 tokens per block), rather than pre-allocating one large contiguous chunk of memory
  2. Dynamic Mapping: A Block Table records the mapping from each sequence's logical blocks to physical blocks; physical blocks can be scattered anywhere in GPU memory
  3. On-demand Allocation: Physical blocks are allocated only when new tokens require a new block, and can be freed immediately after use
  4. Sharing Mechanism: When multiple requests share the same prefix (e.g., the same system prompt), physical blocks can be shared via Copy-on-Write

This design reduces KV Cache memory waste to near zero (only the last block may have minor internal fragmentation), achieving a memory utilization close to 100%.


Continuous Batching

Traditional Static Batching waits for all requests in a batch to finish before processing the next batch. The problem is that different requests vary greatly in generation length — some finish after 10 tokens, while others generate 500. Once short requests complete, they sit idle, leading to low GPU utilization.

vLLM employs Continuous Batching (also known as Iteration-level Batching):

  • Scheduling is performed dynamically at each iteration (i.e., each token generation step)
  • When a request finishes, a new request immediately fills the vacant slot
  • There is no need to wait for the entire batch to complete

Result: In high-concurrency scenarios, throughput improves by 2-4x compared to Static Batching.


Usage

Python API (Offline Inference)

from vllm import LLM, SamplingParams

# 初始化模型
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
          tensor_parallel_size=1,   # GPU并行数
          gpu_memory_utilization=0.9)

# 设置采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# 批量推理
prompts = ["What is deep learning?", "Explain attention mechanism."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Server Mode (Online Serving)

vLLM provides an OpenAI-compatible API server that can serve as a drop-in replacement for the OpenAI API:

# 启动服务
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --tensor-parallel-size 2

Client call (fully compatible with the OpenAI API):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=128
)
print(response.choices[0].message.content)

Comparison with Other Inference Engines

Feature vLLM TGI (HuggingFace) TensorRT-LLM (NVIDIA) llama.cpp
Core Optimization PagedAttention Continuous Batching Graph optimization + Kernel fusion Quantization + CPU optimization
KV Cache Management Paged, near-zero waste Pre-allocated Pre-allocated + optimized Simple contiguous allocation
Throughput Very high High Highest (on specific hardware) Lower
Ease of Use Very good (Python-native) Good (Docker deployment) More complex (requires compilation) Very good (single binary)
Use Case High-concurrency cloud serving Cloud serving Production with extreme performance needs Local / edge inference
Quantization Support GPTQ, AWQ, FP8 GPTQ, BnB FP8, INT8, INT4 GGUF (2-8bit)
Hardware Requirements NVIDIA GPU NVIDIA GPU NVIDIA GPU CPU / GPU

Recommendations:

  • High-concurrency online serving → vLLM (ready out of the box with excellent throughput)
  • Ultra-low latency optimization → TensorRT-LLM (requires engineering effort for compilation and optimization)
  • Rapid prototyping → TGI (well-integrated with the HuggingFace ecosystem)
  • Local personal use → llama.cpp / Ollama (see Local Inference Deployment)

References


评论 #