Working Memory and Context Management
Introduction
The LLM's context window is the agent's working memory. Just as human working memory has limited capacity (Miller's 7 plus or minus 2 law), the LLM's context window also has token limits. How efficiently this finite resource is managed directly determines the agent's capability ceiling.
Context Window as Working Memory
Miller's Law and Token Limits
George Miller (1956) found that human short-term memory capacity is approximately 7 plus or minus 2 chunks. Analogizing to LLMs:
| Model | Context Window | Equivalent "Chunks" |
|---|---|---|
| GPT-3.5 | 4K tokens | ~3,000 words |
| GPT-4 | 8K/32K tokens | ~6,000/24,000 words |
| GPT-4 Turbo | 128K tokens | ~96,000 words |
| Claude 3.5 | 200K tokens | ~150,000 words |
| Gemini 1.5 Pro | 1M tokens | ~750,000 words |
Context Composition
A typical agent's context contains:
[System Prompt] ~500-2000 tokens
[Tool Definitions] ~200-1000 tokens/tool
[Conversation History] Variable
[Retrieved Context] Variable
[Current User Input] Variable
─────────────────────────
Total must be < Context window limit
Attention Mechanism Fundamentals
The Transformer's self-attention mechanism is the core of context processing:
where:
- \(Q\) (Query), \(K\) (Key), \(V\) (Value) are linear transformations of the input
- \(d_k\) is the Key dimension, used for scaling to avoid vanishing gradients
- softmax produces the attention weight distribution
Key Issue: Self-attention has \(O(n^2)\) computational complexity, where \(n\) is the sequence length. This means longer contexts require more computational resources.
The "Lost in the Middle" Problem
Liu et al. (2023) discovered that LLM utilization of information in the middle of the context is significantly lower than at the beginning and end:
Attention Utilization
^
|███ ███
|████ ████
|█████ █████
|██████ ██████
|████████ ████████
|██████████████████████████████
+─────────────────────────────→ Position
Beginning Middle End
Implication: Place the most important information at the beginning and end of the context.
Context Compression Strategies
1. Conversation History Summarization
Compress overly long conversation histories into summaries:
def compress_history(messages, max_tokens=2000):
"""Compress conversation history into summary"""
if count_tokens(messages) <= max_tokens:
return messages
# Keep system prompt and recent messages
system = messages[0]
recent = messages[-4:] # Keep last 2 conversation turns
# Summarize middle portion
middle = messages[1:-4]
summary = llm.summarize(middle)
return [system, {"role": "system", "content": f"Conversation summary: {summary}"}] + recent
2. Sliding Window
Keep the most recent N conversation turns, discard older content:
def sliding_window(messages, window_size=10):
"""Sliding window strategy"""
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Keep most recent window_size turns
recent = conversation[-window_size * 2:]
return system + recent
3. Recursive Summarization
def recursive_summarize(messages, chunk_size=10):
"""Recursive summarization: chunk history, summarize, then re-summarize"""
if len(messages) <= chunk_size:
return summarize(messages)
chunks = [messages[i:i+chunk_size] for i in range(0, len(messages), chunk_size)]
summaries = [summarize(chunk) for chunk in chunks]
return recursive_summarize(summaries, chunk_size)
4. Retrieval-Augmented Context Management
Instead of storing complete history, retrieve relevant segments on demand:
def retrieval_augmented_context(query, history_store, top_k=5):
"""Retrieve relevant history based on current query"""
# Vectorize the query
query_embedding = embed(query)
# Retrieve most relevant segments from history store
relevant = history_store.similarity_search(query_embedding, top_k=top_k)
# Assemble context
context = "\n".join([f"[{r.timestamp}] {r.content}" for r in relevant])
return context
Advanced Context Management Strategies
Token Budget Allocation
pie title Context Token Budget Allocation (128K Window Example)
"System Prompt" : 5
"Tool Definitions" : 10
"Retrieved Context" : 30
"Conversation History" : 35
"Current Input" : 10
"Reserved Generation Space" : 10
Dynamic Context Management
class ContextManager:
def __init__(self, max_tokens=128000):
self.max_tokens = max_tokens
self.budget = {
"system": int(max_tokens * 0.05),
"tools": int(max_tokens * 0.10),
"retrieval": int(max_tokens * 0.30),
"history": int(max_tokens * 0.35),
"input": int(max_tokens * 0.10),
"generation": int(max_tokens * 0.10),
}
def build_context(self, system_prompt, tools, query, history, retrieval_results):
context = []
# 1. System prompt (must be complete)
context.append(truncate(system_prompt, self.budget["system"]))
# 2. Tool definitions (ranked by relevance)
relevant_tools = rank_tools_by_relevance(tools, query)
context.extend(fit_within_budget(relevant_tools, self.budget["tools"]))
# 3. Retrieved results (ranked by relevance)
context.extend(fit_within_budget(retrieval_results, self.budget["retrieval"]))
# 4. Conversation history (most recent first)
compressed_history = self.compress_history(history, self.budget["history"])
context.extend(compressed_history)
# 5. Current input
context.append(query)
return context
def compress_history(self, history, budget):
"""Intelligently compress conversation history"""
if count_tokens(history) <= budget:
return history
# Strategy: keep most recent N turns + summary
recent = history[-6:] # Most recent 3 turns
remaining_budget = budget - count_tokens(recent)
if remaining_budget > 0:
older = history[:-6]
summary = summarize_to_budget(older, remaining_budget)
return [summary] + recent
else:
return recent[-4:] # Degrade to keeping most recent 2 turns
Strategies for Ultra-Long Contexts (128K+)
For models supporting ultra-long contexts:
- Hierarchical organization: Organize context sections with clear separators and headers
- Importance markers: Add explicit markers to key information (e.g.,
[IMPORTANT]) - Structured format: Use XML/JSON to structure context for easier model parsing
- Redundancy elimination: Remove duplicate information to avoid wasting tokens
def structure_context(sections):
"""Structure context using XML tags"""
context = ""
for section in sections:
context += f"<{section.tag} priority='{section.priority}'>\n"
context += section.content
context += f"\n</{section.tag}>\n\n"
return context
Context Caching
Anthropic Prompt Caching
# Anthropic's prompt caching: cache static prefixes
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # This part will be cached
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_query}]
)
# On cache hit, latency reduced ~85%, cost reduced ~90%
Google Gemini Context Caching
# Gemini context caching
cached_content = genai.caching.CachedContent.create(
model="gemini-1.5-pro",
contents=[large_document],
ttl=datetime.timedelta(hours=1),
)
model = genai.GenerativeModel.from_cached_content(cached_content)
response = model.generate_content("Analyze this document...")
Attention Optimization Techniques
Sparse Attention
Instead of computing attention between all token pairs, only attend to local windows and global anchors:
- Longformer: Local sliding window + global attention
- BigBird: Random attention + local attention + global attention
KV Cache Optimization
- PagedAttention (vLLM): Memory management similar to OS paging
- Grouped-Query Attention (GQA): Reduces KV head count to lower cache size
- Multi-Query Attention (MQA): All heads share the same KV set
Practical Recommendations
Context Management Checklist
- [ ] Define a clear token budget allocation scheme
- [ ] Implement automatic conversation history compression
- [ ] Dynamically load tool definitions (only load relevant tools)
- [ ] Place important information at the beginning and end of context
- [ ] Use structured formats to organize context
- [ ] Consider using prompt caching to reduce latency and cost
- [ ] Monitor actual token usage and optimize
Common Pitfalls
- Not monitoring token usage: Leading to unexpected context overflow
- Simple truncation: Crude truncation may lose critical information
- Ignoring Lost in the Middle: Piling all retrieved results in the middle
- Over-compression: Summaries may lose key details
Further Reading
- Chain-of-Thought and Reasoning Patterns - How CoT leverages the context window for reasoning
- Liu, N. F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts"
- Packer, C., et al. (2023). "MemGPT: Towards LLMs as Operating Systems"
- Munkhdalai, T., et al. (2024). "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention"