Working Memory and Context Management

Introduction

The LLM's context window is the agent's working memory. Just as human working memory has limited capacity (Miller's 7 plus or minus 2 law), the LLM's context window also has token limits. How efficiently this finite resource is managed directly determines the agent's capability ceiling.

Context Window as Working Memory

Miller's Law and Token Limits

George Miller (1956) found that human short-term memory capacity is approximately 7 plus or minus 2 chunks. Analogizing to LLMs:

Model	Context Window	Equivalent "Chunks"
GPT-3.5	4K tokens	~3,000 words
GPT-4	8K/32K tokens	~6,000/24,000 words
GPT-4 Turbo	128K tokens	~96,000 words
Claude 3.5	200K tokens	~150,000 words
Gemini 1.5 Pro	1M tokens	~750,000 words

Context Composition

A typical agent's context contains:

[System Prompt]         ~500-2000 tokens
[Tool Definitions]      ~200-1000 tokens/tool
[Conversation History]  Variable
[Retrieved Context]     Variable
[Current User Input]    Variable
─────────────────────────
Total must be < Context window limit

Attention Mechanism Fundamentals

The Transformer's self-attention mechanism is the core of context processing:

\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

where:

\(Q\) (Query), \(K\) (Key), \(V\) (Value) are linear transformations of the input
\(d_k\) is the Key dimension, used for scaling to avoid vanishing gradients
softmax produces the attention weight distribution

Key Issue: Self-attention has \(O(n^2)\) computational complexity, where \(n\) is the sequence length. This means longer contexts require more computational resources.

The "Lost in the Middle" Problem

Liu et al. (2023) discovered that LLM utilization of information in the middle of the context is significantly lower than at the beginning and end:

Attention Utilization
^
|███                          ███
|████                        ████
|█████                      █████
|██████                    ██████
|████████              ████████
|██████████████████████████████
+─────────────────────────────→ Position
 Beginning    Middle       End

Implication: Place the most important information at the beginning and end of the context.

Context Compression Strategies

1. Conversation History Summarization

Compress overly long conversation histories into summaries:

def compress_history(messages, max_tokens=2000):
    """Compress conversation history into summary"""
    if count_tokens(messages) <= max_tokens:
        return messages

    # Keep system prompt and recent messages
    system = messages[0]
    recent = messages[-4:]  # Keep last 2 conversation turns

    # Summarize middle portion
    middle = messages[1:-4]
    summary = llm.summarize(middle)

    return [system, {"role": "system", "content": f"Conversation summary: {summary}"}] + recent

2. Sliding Window

Keep the most recent N conversation turns, discard older content:

def sliding_window(messages, window_size=10):
    """Sliding window strategy"""
    system = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]

    # Keep most recent window_size turns
    recent = conversation[-window_size * 2:]
    return system + recent

3. Recursive Summarization

def recursive_summarize(messages, chunk_size=10):
    """Recursive summarization: chunk history, summarize, then re-summarize"""
    if len(messages) <= chunk_size:
        return summarize(messages)

    chunks = [messages[i:i+chunk_size] for i in range(0, len(messages), chunk_size)]
    summaries = [summarize(chunk) for chunk in chunks]
    return recursive_summarize(summaries, chunk_size)

4. Retrieval-Augmented Context Management

Instead of storing complete history, retrieve relevant segments on demand:

def retrieval_augmented_context(query, history_store, top_k=5):
    """Retrieve relevant history based on current query"""
    # Vectorize the query
    query_embedding = embed(query)

    # Retrieve most relevant segments from history store
    relevant = history_store.similarity_search(query_embedding, top_k=top_k)

    # Assemble context
    context = "\n".join([f"[{r.timestamp}] {r.content}" for r in relevant])
    return context

Advanced Context Management Strategies

Token Budget Allocation

pie title Context Token Budget Allocation (128K Window Example)
    "System Prompt" : 5
    "Tool Definitions" : 10
    "Retrieved Context" : 30
    "Conversation History" : 35
    "Current Input" : 10
    "Reserved Generation Space" : 10

Dynamic Context Management

class ContextManager:
    def __init__(self, max_tokens=128000):
        self.max_tokens = max_tokens
        self.budget = {
            "system": int(max_tokens * 0.05),
            "tools": int(max_tokens * 0.10),
            "retrieval": int(max_tokens * 0.30),
            "history": int(max_tokens * 0.35),
            "input": int(max_tokens * 0.10),
            "generation": int(max_tokens * 0.10),
        }

    def build_context(self, system_prompt, tools, query, history, retrieval_results):
        context = []

        # 1. System prompt (must be complete)
        context.append(truncate(system_prompt, self.budget["system"]))

        # 2. Tool definitions (ranked by relevance)
        relevant_tools = rank_tools_by_relevance(tools, query)
        context.extend(fit_within_budget(relevant_tools, self.budget["tools"]))

        # 3. Retrieved results (ranked by relevance)
        context.extend(fit_within_budget(retrieval_results, self.budget["retrieval"]))

        # 4. Conversation history (most recent first)
        compressed_history = self.compress_history(history, self.budget["history"])
        context.extend(compressed_history)

        # 5. Current input
        context.append(query)

        return context

    def compress_history(self, history, budget):
        """Intelligently compress conversation history"""
        if count_tokens(history) <= budget:
            return history

        # Strategy: keep most recent N turns + summary
        recent = history[-6:]  # Most recent 3 turns
        remaining_budget = budget - count_tokens(recent)

        if remaining_budget > 0:
            older = history[:-6]
            summary = summarize_to_budget(older, remaining_budget)
            return [summary] + recent
        else:
            return recent[-4:]  # Degrade to keeping most recent 2 turns

Strategies for Ultra-Long Contexts (128K+)

For models supporting ultra-long contexts:

Hierarchical organization: Organize context sections with clear separators and headers
Importance markers: Add explicit markers to key information (e.g., [IMPORTANT])
Structured format: Use XML/JSON to structure context for easier model parsing
Redundancy elimination: Remove duplicate information to avoid wasting tokens

def structure_context(sections):
    """Structure context using XML tags"""
    context = ""
    for section in sections:
        context += f"<{section.tag} priority='{section.priority}'>\n"
        context += section.content
        context += f"\n</{section.tag}>\n\n"
    return context

Context Caching

Anthropic Prompt Caching

# Anthropic's prompt caching: cache static prefixes
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # This part will be cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)
# On cache hit, latency reduced ~85%, cost reduced ~90%

Google Gemini Context Caching

# Gemini context caching
cached_content = genai.caching.CachedContent.create(
    model="gemini-1.5-pro",
    contents=[large_document],
    ttl=datetime.timedelta(hours=1),
)

model = genai.GenerativeModel.from_cached_content(cached_content)
response = model.generate_content("Analyze this document...")

Attention Optimization Techniques

Sparse Attention

Instead of computing attention between all token pairs, only attend to local windows and global anchors:

Longformer: Local sliding window + global attention
BigBird: Random attention + local attention + global attention

KV Cache Optimization

PagedAttention (vLLM): Memory management similar to OS paging
Grouped-Query Attention (GQA): Reduces KV head count to lower cache size
Multi-Query Attention (MQA): All heads share the same KV set

Practical Recommendations

Context Management Checklist

[ ] Define a clear token budget allocation scheme
[ ] Implement automatic conversation history compression
[ ] Dynamically load tool definitions (only load relevant tools)
[ ] Place important information at the beginning and end of context
[ ] Use structured formats to organize context
[ ] Consider using prompt caching to reduce latency and cost
[ ] Monitor actual token usage and optimize

Common Pitfalls

Not monitoring token usage: Leading to unexpected context overflow
Simple truncation: Crude truncation may lose critical information
Ignoring Lost in the Middle: Piling all retrieved results in the middle
Over-compression: Summaries may lose key details