KV Cache and Long-Context Inference

KV Cache is a core technique for accelerating LLM inference. As model context windows expand from 2K to 128K or even millions of tokens, efficiently managing the KV Cache and supporting long-context inference has become a critical challenge in inference engineering.

For the fundamentals of KV Cache and PagedAttention, see vLLM.

KV Cache Recap

Why KV Cache Is Needed

In the autoregressive generation process of a Transformer, generating each new token requires computing Attention over all preceding tokens. KV Cache stores the previously computed Keys and Values to avoid redundant computation:

\[ \text{Attention}(q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t} \]

At each step, only the new token's \(q_t, k_t, v_t\) need to be computed, after which \(k_t, v_t\) are appended to the cache.

Memory Footprint Estimation

For a model with \(L\) layers, \(H\) attention heads, per-head dimension \(d\), and sequence length \(s\), the KV Cache memory footprint is:

\[ \text{Memory} = 2 \times L \times H \times d \times s \times \text{sizeof(dtype)} \]

模型	参数量	序列长度	KV Cache (FP16)
LLaMA-2-7B	7B	4K	~1 GB
LLaMA-2-13B	13B	4K	~1.6 GB
LLaMA-3-70B	70B	8K	~10 GB
LLaMA-3-70B	70B	128K	~160 GB

As shown, KV Cache memory grows linearly with context length. In long-context scenarios, it can far exceed the memory occupied by the model weights themselves.

Core Challenges of Long Context

Memory Bottleneck

The KV Cache for standard Attention scales linearly with sequence length, but total memory consumption explodes under concurrent requests. For example, with a 70B model at 128K context length, the KV Cache for a single request already requires ~160 GB, making production deployment extremely difficult.

Attention Computation Complexity

Standard Self-Attention has \(O(n^2)\) computational complexity. When the sequence length reaches 128K, the Attention computation itself becomes a bottleneck.

Positional Encoding Extrapolation

Traditional absolute positional encodings fail when extrapolated beyond the training length, necessitating positional encoding schemes that can generalize to longer sequences.

Positional Encoding and Long-Context Support

RoPE (Rotary Position Embedding)

RoPE is the dominant positional encoding scheme used by today's leading LLMs (LLaMA, Qwen, Mistral, etc.). It encodes positional information as rotation matrices, so that attention scores naturally depend on the relative distance between tokens:

\[ f(q, m) = q \cdot e^{im\theta}, \quad f(k, n) = k \cdot e^{in\theta} \]

\[ \langle f(q,m), f(k,n) \rangle = \text{Re}[q \cdot k^* \cdot e^{i(m-n)\theta}] \]

where \(\theta_j = 10000^{-2j/d}\) is the frequency base.

Methods for extending to longer contexts:

NTK-aware Scaling: Adjusts the frequency base via \(\theta' = \theta \cdot \alpha^{d/(d-2)}\), compressing low-frequency components to cover longer ranges while leaving high-frequency components unchanged
YaRN (Yet another RoPE extensioN): Combines NTK Scaling with attention distribution correction, extending the context window while preserving the shape of the attention distribution
Dynamic NTK: Dynamically adjusts the scaling factor based on the actual input length, leaving short sequences unaffected

ALiBi (Attention with Linear Biases)

ALiBi foregoes positional encodings entirely, instead adding a linear bias penalty to the Attention scores:

\[ \text{score}(q_i, k_j) = q_i \cdot k_j^T - m \cdot |i - j| \]

where \(m\) is a slope specific to each Attention Head. The farther apart two tokens are, the larger the penalty. This design inherently supports length extrapolation, since it requires no learned positional encodings and the pattern remains consistent at any sequence length.

KV Cache Compression Techniques

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

In standard Multi-Head Attention, each head has its own independent K and V projections. MQA and GQA reduce the KV Cache by sharing K and V across heads:

方式	K/V Head 数	KV Cache 压缩比	代表模型
MHA (标准)	\(H\)	1x	GPT-3
GQA	\(G\) (\(1 < G < H\))	\(H/G\) x	LLaMA-2-70B, LLaMA-3
MQA	1	\(H\) x	PaLM, Falcon

GQA is the predominant choice today, compressing the KV Cache by several times with virtually no quality degradation. For instance, LLaMA-3 uses 8-group GQA (\(H=32, G=8\)), reducing the KV Cache to 1/4 of its original size.

KV Cache Quantization

Quantizing the KV Cache from FP16 to INT8 or INT4 further reduces memory usage. Research has shown that the distribution of Keys in the KV Cache tends to be more uniform than that of Values, making Keys more amenable to low-bit quantization:

KV Cache INT8 Quantization: Halves memory usage with negligible accuracy loss
KV Cache INT4 Quantization: Reduces memory to 1/4, though some perceptible accuracy degradation may occur on certain tasks in long-context scenarios

Mainstream frameworks such as vLLM and TensorRT-LLM already support FP8 KV Cache.

Token Eviction and Sparse Attention

Not all tokens' KV entries are equally important. Several methods identify "unimportant" tokens and evict their KV entries to compress the cache:

H2O (Heavy-Hitter Oracle): Retains tokens with the highest cumulative Attention scores ("heavy hitters") along with the most recent tokens, evicting the rest
StreamingLLM: Retains only the first few "attention sink" tokens plus tokens within a sliding window, enabling theoretically unbounded streaming inference
Sliding Window Attention: Each token attends only to the most recent \(W\) tokens (Mistral uses \(W=4096\)), fixing the KV Cache size at \(W\)

Engineering Practices for Long-Context Inference

Prefill-Decode Disaggregation

A key optimization for long-context inference is separating the Prefill phase (processing the prompt) from the Decode phase (token-by-token generation):

Prefill phase: Compute-intensive and highly parallelizable. Processes the entire prompt and computes the KV Cache for all tokens in one pass
Decode phase: Memory-bandwidth-intensive, generating only one token per step

Disaggregated Serving allows the two phases to be optimized independently using different hardware or different batching strategies.

Prefix Caching

When multiple requests share the same prefix (e.g., a system prompt), the prefix's KV Cache can be cached and reused:

请求A: [System Prompt] + [用户问题A]
请求B: [System Prompt] + [用户问题B]
                  ↑
         共享的 KV Cache，只需计算一次

Both vLLM's Automatic Prefix Caching and SGLang's RadixAttention implement this mechanism.

Chunked Prefill

For extremely long prompts, performing Prefill in a single pass may cause memory overflow or monopolize the GPU for an extended period. Chunked Prefill splits the prompt into smaller chunks and processes them incrementally:

Avoids peak memory spikes from a single Prefill pass
Allows Decode steps from other requests to be interleaved between chunks, reducing latency

Flash Attention

Flash Attention accelerates Attention computation through tiled computation and reduced HBM accesses, making it an essential optimization for long-context inference:

Flash Attention 2: Improves parallelism strategies and warp scheduling
Flash Attention 3: Optimized for the Hopper architecture (H100), leveraging asynchronous execution and FP8 support

Flash Attention does not reduce the size of the KV Cache, but it dramatically speeds up the Attention computation itself, making longer contexts computationally feasible.

Long-Context Support Across Major Frameworks

框架	Prefix Caching	KV Cache 量化	Chunked Prefill	Flash Attention
vLLM	自动前缀缓存	FP8	支持	FA2
TensorRT-LLM	支持	FP8, INT8	支持	FA2/FA3
SGLang	RadixAttention	FP8	支持	FA2
DeepSpeed-MII	支持	支持	支持	FA2

References

Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints", EMNLP 2023
Xiao et al., "Efficient Streaming Language Models with Attention Sinks", ICLR 2024
Dao et al., "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning", ICLR 2024
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", Neurocomputing 2024