Memory Architecture Design
Introduction
Designing a complete agent memory architecture requires comprehensive consideration of storage tiers, retrieval efficiency, memory updates, and forgetting mechanisms. This section introduces frontier approaches such as MemGPT, as well as practical memory architecture design patterns.
MemGPT: Treating the LLM as an Operating System
Core Idea
Packer et al. (2023) proposed MemGPT, which analogizes the LLM's context window to an operating system's main memory (RAM) and external storage to disk. The LLM itself acts as the "processor," managing memory reads and writes through function calls.
graph TB
subgraph "MemGPT Architecture"
subgraph "Main Context (RAM)"
SP[System Instructions<br/>System Prompt]
WM[Working Context]
FIFO[Conversation Buffer<br/>FIFO Queue]
end
subgraph "External Storage (Disk)"
RM[(Recall Storage)]
AM[(Archival Storage)]
end
LLM[LLM Processor] -->|core_memory_append| WM
LLM -->|core_memory_replace| WM
LLM -->|conversation_search| RM
LLM -->|archival_memory_insert| AM
LLM -->|archival_memory_search| AM
FIFO -->|Overflow| RM
end
Key Mechanisms
- Core Memory: Critical information always kept in context (user profile, system state)
- Recall Storage: Vectorized storage of conversation history
- Archival Storage: Vectorized storage of long-term knowledge
- Self-Management: LLM autonomously decides when to read/write memory through function calls
# MemGPT-style memory management functions
MEMGPT_TOOLS = [
{
"name": "core_memory_append",
"description": "Append information to a specified section of core memory",
"parameters": {
"section": {"type": "string", "enum": ["human", "persona"]},
"content": {"type": "string"}
}
},
{
"name": "core_memory_replace",
"description": "Replace specific content in core memory",
"parameters": {
"section": {"type": "string"},
"old_content": {"type": "string"},
"new_content": {"type": "string"}
}
},
{
"name": "archival_memory_search",
"description": "Search archival memory",
"parameters": {
"query": {"type": "string"},
"page": {"type": "integer"}
}
},
{
"name": "archival_memory_insert",
"description": "Write information to archival memory",
"parameters": {
"content": {"type": "string"}
}
},
{
"name": "conversation_search",
"description": "Search historical conversations",
"parameters": {
"query": {"type": "string"},
"page": {"type": "integer"}
}
}
]
Tiered Memory Architecture
Hot/Warm/Cold Tiers
graph TB
subgraph "Hot Tier"
H1[Information in context window]
H2[Core memory / User profile]
H3[Current task state]
end
subgraph "Warm Tier"
W1[Recent N turns of conversation history]
W2[Frequently accessed knowledge]
W3[Documents related to active tasks]
end
subgraph "Cold Tier"
C1[Historical conversation archives]
C2[Infrequently used knowledge documents]
C3[Records of completed tasks]
end
H1 -.->|Overflow/Aging| W1
W1 -.->|Archive| C1
C1 -.->|On-demand retrieval| W1
W1 -.->|Inject into context| H1
Tiering Strategy
| Tier | Storage Location | Access Latency | Capacity | Content |
|---|---|---|---|---|
| Hot | LLM context | 0ms | Several K tokens | System prompt, core memory, current conversation |
| Warm | In-memory cache + Vector DB | 10-100ms | Several MB | Recent history, high-frequency knowledge |
| Cold | Persistent storage | 100ms-1s | Several GB | Full history, complete knowledge base |
class TieredMemory:
def __init__(self, hot_budget=4000, warm_size=1000):
self.hot = HotMemory(max_tokens=hot_budget)
self.warm = WarmMemory(max_items=warm_size) # In-memory vector cache
self.cold = ColdMemory() # Persistent vector database
def retrieve(self, query, max_results=5):
"""Tiered retrieval: hot first, then warm, then cold"""
# 1. Hot tier: directly in context
hot_results = self.hot.search(query)
if len(hot_results) >= max_results:
return hot_results[:max_results]
# 2. Warm tier: in-memory vector search
warm_results = self.warm.search(query, k=max_results - len(hot_results))
combined = hot_results + warm_results
if len(combined) >= max_results:
return combined[:max_results]
# 3. Cold tier: persistent storage search
cold_results = self.cold.search(query, k=max_results - len(combined))
# Promote cold tier hits to warm tier
for result in cold_results:
self.warm.promote(result)
return (combined + cold_results)[:max_results]
def store(self, content, importance=5):
"""Decide storage tier based on importance"""
if importance >= 8:
self.hot.add(content) # High importance: directly into context
self.warm.add(content) # Also backup in warm tier
elif importance >= 4:
self.warm.add(content) # Medium importance: warm tier
else:
self.cold.add(content) # Low importance: cold tier
Memory Consolidation
Inspiration: Human Memory Consolidation
Humans perform memory consolidation during sleep: converting short-term memories into long-term ones, integrating old and new knowledge, and extracting patterns and regularities.
Agent Memory Consolidation
class MemoryConsolidation:
def __init__(self, episodic_memory, semantic_memory, llm):
self.episodic = episodic_memory
self.semantic = semantic_memory
self.llm = llm
def consolidate(self):
"""Periodically perform memory consolidation"""
# 1. Extract recent experiences
recent_episodes = self.episodic.get_recent(hours=24)
# 2. Extract knowledge from experiences
knowledge = self.llm.generate(
f"Extract general knowledge and patterns from the following experiences:\n"
f"{recent_episodes}\n"
f"Output format: list of (subject, relation, object) triples")
# 3. Store in semantic memory
for triple in knowledge:
self.semantic.add_knowledge(
triple["subject"], triple["predicate"], triple["object"])
# 4. Generate reflections
reflection = self.llm.generate(
f"Based on recent experiences, what important observations or lessons "
f"are there?\n{recent_episodes}")
# 5. Store reflections as high-importance episodes
self.episodic.store_episode({
"type": "reflection",
"content": reflection,
"importance": 9,
})
return {"knowledge_extracted": len(knowledge), "reflection": reflection}
Importance Assessment
Importance Scoring
Not all information is equally important. Importance scores determine memory storage tier and retention time.
def score_importance(content, llm):
"""Use LLM to assess information importance (1-10)"""
score = llm.generate(
f"On a scale of 1 to 10, assess the long-term importance of the "
f"following information to an AI assistant.\n"
f"1=completely trivial (small talk), 10=extremely important "
f"(key preferences/knowledge).\n"
f"Information: {content}\n"
f"Output only the number:")
return int(score.strip())
Rule-Based Importance Assessment
def rule_based_importance(content, metadata):
"""Quick rule-based importance assessment"""
score = 5 # Base score
# User explicitly asked to remember
if "remember" in content.lower() or "don't forget" in content.lower():
score += 3
# Contains personal preferences
if any(kw in content.lower() for kw in ["like", "dislike", "prefer", "hate"]):
score += 2
# Task success/failure experience
if metadata.get("outcome") == "failure":
score += 2 # Failure experiences are more worth remembering
# Contains specific numbers or facts
if any(char.isdigit() for char in content):
score += 1
return min(score, 10)
Forgetting Curve and Memory Decay
Ebbinghaus Forgetting Curve
Memory retention decays exponentially over time:
where:
- \(R\) is memory retention
- \(t\) is elapsed time
- \(S\) is memory strength, related to number of reviews and importance
Implementation
import math
from datetime import datetime
class MemoryWithDecay:
def __init__(self):
self.memories = []
def compute_retention(self, memory):
"""Compute memory retention"""
hours_since_access = (
datetime.now() - memory["last_accessed"]
).total_seconds() / 3600
# Memory strength S related to access count and importance
strength = memory["access_count"] * 0.5 + memory["importance"] * 2
retention = math.exp(-hours_since_access / max(strength, 1))
return retention
def decay_and_cleanup(self, threshold=0.1):
"""Decay and clean up memories with retention below threshold"""
to_keep = []
to_archive = []
for mem in self.memories:
retention = self.compute_retention(mem)
mem["retention"] = retention
if retention >= threshold:
to_keep.append(mem)
else:
to_archive.append(mem)
self.memories = to_keep
return to_archive # Return archived memories
def access(self, memory_id):
"""Update metadata when accessing a memory (spaced repetition effect)"""
for mem in self.memories:
if mem["id"] == memory_id:
mem["access_count"] += 1
mem["last_accessed"] = datetime.now()
break
Practical Design Patterns
Pattern 1: Progressive User Profile Construction
class UserProfile:
"""Progressively build user profile through conversation"""
def __init__(self, llm):
self.profile = {
"preferences": {},
"background": {},
"goals": {},
"communication_style": {},
}
self.llm = llm
def update_from_conversation(self, messages):
"""Extract user information from conversation"""
extracted = self.llm.generate(
f"Extract user preferences, background, goals, and other "
f"information from the following conversation.\n"
f"Conversation: {messages}\n"
f"Current profile: {self.profile}\n"
f"Output updated profile (JSON):")
self.profile = merge_profiles(self.profile, extracted)
Pattern 2: Task Context Persistence
class TaskContext:
"""Cross-session task context"""
def __init__(self, store):
self.store = store
def save_checkpoint(self, task_id, state):
"""Save task checkpoint"""
self.store.save({
"task_id": task_id,
"state": state,
"progress": state.get("progress", 0),
"next_steps": state.get("next_steps", []),
"timestamp": datetime.now().isoformat(),
})
def resume_task(self, task_id):
"""Resume task context"""
checkpoint = self.store.get_latest(task_id)
if checkpoint:
return (
f"Task progress: {checkpoint['progress']}%\n"
f"Next steps: {checkpoint['next_steps']}\n"
f"Last updated: {checkpoint['timestamp']}")
return None
Pattern 3: Memory Indexing and Tags
class TaggedMemory:
"""Tagged memory system supporting multi-dimensional retrieval"""
def __init__(self, vector_store):
self.vector_store = vector_store
def store(self, content, tags, importance=5):
self.vector_store.add(
documents=[content],
metadatas=[{
"tags": ",".join(tags),
"importance": importance,
"timestamp": datetime.now().isoformat(),
}],
ids=[generate_id()]
)
def search_by_tag(self, tag, query=None, n=10):
"""Search by tag, optionally combined with semantic search"""
filters = {"tags": {"$contains": tag}}
if query:
return self.vector_store.query(
query_texts=[query], n_results=n, where=filters)
else:
return self.vector_store.get(where=filters, limit=n)
Comprehensive Architecture Example
graph TB
subgraph "Agent Comprehensive Memory Architecture"
INPUT[Input] --> ROUTER{Memory Router}
ROUTER -->|Store| IMP[Importance Assessment]
IMP -->|High| HOT[Hot Tier<br/>Core Memory]
IMP -->|Medium| WARM[Warm Tier<br/>Vector Cache]
IMP -->|Low| COLD[Cold Tier<br/>Archival Storage]
ROUTER -->|Retrieve| QP[Query Processing]
QP --> SEARCH[Tiered Search]
HOT --> SEARCH
WARM --> SEARCH
COLD --> SEARCH
SEARCH --> RERANK[Reranking]
RERANK --> CTX[Context Injection]
subgraph "Background Tasks"
CONS[Memory Consolidation<br/>Periodic]
DECAY[Decay Cleanup<br/>Periodic]
CONS --> WARM
DECAY --> COLD
end
end
CTX --> LLM[LLM Reasoning]
LLM --> OUTPUT[Output]
OUTPUT -->|Feedback| ROUTER
Practical Recommendations
Design Checklist
- [ ] Define clear tiering strategy (hot/warm/cold)
- [ ] Implement importance assessment mechanism
- [ ] Design forgetting/decay strategy to prevent memory bloat
- [ ] Implement memory consolidation pipeline (episodic to semantic)
- [ ] Add memory conflict detection and resolution mechanism
- [ ] Monitor memory system retrieval quality and latency
Considerations
- Privacy: User memory data needs proper protection
- Consistency: Avoid storing mutually contradictory memories
- Interpretability: Users should be able to view and manage the agent's "memories" about them
- Cost: Storage and query costs of vector databases need to be controlled
References
- Packer, C., et al. (2023). "MemGPT: Towards LLMs as Operating Systems"
- Park, J. S., et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior"
- Ebbinghaus, H. (1885). "Uber das Gedachtnis"
- Zhang, Z., et al. (2024). "A Survey on the Memory Mechanism of Large Language Model based Agents"
- Modarressi, A., et al. (2024). "MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory"