Memory Architecture Design

Introduction

Designing a complete agent memory architecture requires comprehensive consideration of storage tiers, retrieval efficiency, memory updates, and forgetting mechanisms. This section introduces frontier approaches such as MemGPT, as well as practical memory architecture design patterns.

MemGPT: Treating the LLM as an Operating System

Core Idea

Packer et al. (2023) proposed MemGPT, which analogizes the LLM's context window to an operating system's main memory (RAM) and external storage to disk. The LLM itself acts as the "processor," managing memory reads and writes through function calls.

graph TB
    subgraph "MemGPT Architecture"
        subgraph "Main Context (RAM)"
            SP[System Instructions<br/>System Prompt]
            WM[Working Context]
            FIFO[Conversation Buffer<br/>FIFO Queue]
        end

        subgraph "External Storage (Disk)"
            RM[(Recall Storage)]
            AM[(Archival Storage)]
        end

        LLM[LLM Processor] -->|core_memory_append| WM
        LLM -->|core_memory_replace| WM
        LLM -->|conversation_search| RM
        LLM -->|archival_memory_insert| AM
        LLM -->|archival_memory_search| AM

        FIFO -->|Overflow| RM
    end

Key Mechanisms

Core Memory: Critical information always kept in context (user profile, system state)
Recall Storage: Vectorized storage of conversation history
Archival Storage: Vectorized storage of long-term knowledge
Self-Management: LLM autonomously decides when to read/write memory through function calls

# MemGPT-style memory management functions
MEMGPT_TOOLS = [
    {
        "name": "core_memory_append",
        "description": "Append information to a specified section of core memory",
        "parameters": {
            "section": {"type": "string", "enum": ["human", "persona"]},
            "content": {"type": "string"}
        }
    },
    {
        "name": "core_memory_replace",
        "description": "Replace specific content in core memory",
        "parameters": {
            "section": {"type": "string"},
            "old_content": {"type": "string"},
            "new_content": {"type": "string"}
        }
    },
    {
        "name": "archival_memory_search",
        "description": "Search archival memory",
        "parameters": {
            "query": {"type": "string"},
            "page": {"type": "integer"}
        }
    },
    {
        "name": "archival_memory_insert",
        "description": "Write information to archival memory",
        "parameters": {
            "content": {"type": "string"}
        }
    },
    {
        "name": "conversation_search",
        "description": "Search historical conversations",
        "parameters": {
            "query": {"type": "string"},
            "page": {"type": "integer"}
        }
    }
]

Tiered Memory Architecture

Hot/Warm/Cold Tiers

graph TB
    subgraph "Hot Tier"
        H1[Information in context window]
        H2[Core memory / User profile]
        H3[Current task state]
    end

    subgraph "Warm Tier"
        W1[Recent N turns of conversation history]
        W2[Frequently accessed knowledge]
        W3[Documents related to active tasks]
    end

    subgraph "Cold Tier"
        C1[Historical conversation archives]
        C2[Infrequently used knowledge documents]
        C3[Records of completed tasks]
    end

    H1 -.->|Overflow/Aging| W1
    W1 -.->|Archive| C1
    C1 -.->|On-demand retrieval| W1
    W1 -.->|Inject into context| H1

Tiering Strategy

Tier	Storage Location	Access Latency	Capacity	Content
Hot	LLM context	0ms	Several K tokens	System prompt, core memory, current conversation
Warm	In-memory cache + Vector DB	10-100ms	Several MB	Recent history, high-frequency knowledge
Cold	Persistent storage	100ms-1s	Several GB	Full history, complete knowledge base

class TieredMemory:
    def __init__(self, hot_budget=4000, warm_size=1000):
        self.hot = HotMemory(max_tokens=hot_budget)
        self.warm = WarmMemory(max_items=warm_size)  # In-memory vector cache
        self.cold = ColdMemory()  # Persistent vector database

    def retrieve(self, query, max_results=5):
        """Tiered retrieval: hot first, then warm, then cold"""
        # 1. Hot tier: directly in context
        hot_results = self.hot.search(query)
        if len(hot_results) >= max_results:
            return hot_results[:max_results]

        # 2. Warm tier: in-memory vector search
        warm_results = self.warm.search(query, k=max_results - len(hot_results))
        combined = hot_results + warm_results
        if len(combined) >= max_results:
            return combined[:max_results]

        # 3. Cold tier: persistent storage search
        cold_results = self.cold.search(query, k=max_results - len(combined))

        # Promote cold tier hits to warm tier
        for result in cold_results:
            self.warm.promote(result)

        return (combined + cold_results)[:max_results]

    def store(self, content, importance=5):
        """Decide storage tier based on importance"""
        if importance >= 8:
            self.hot.add(content)    # High importance: directly into context
            self.warm.add(content)   # Also backup in warm tier
        elif importance >= 4:
            self.warm.add(content)   # Medium importance: warm tier
        else:
            self.cold.add(content)   # Low importance: cold tier

Memory Consolidation

Inspiration: Human Memory Consolidation

Humans perform memory consolidation during sleep: converting short-term memories into long-term ones, integrating old and new knowledge, and extracting patterns and regularities.

Agent Memory Consolidation

class MemoryConsolidation:
    def __init__(self, episodic_memory, semantic_memory, llm):
        self.episodic = episodic_memory
        self.semantic = semantic_memory
        self.llm = llm

    def consolidate(self):
        """Periodically perform memory consolidation"""
        # 1. Extract recent experiences
        recent_episodes = self.episodic.get_recent(hours=24)

        # 2. Extract knowledge from experiences
        knowledge = self.llm.generate(
            f"Extract general knowledge and patterns from the following experiences:\n"
            f"{recent_episodes}\n"
            f"Output format: list of (subject, relation, object) triples")

        # 3. Store in semantic memory
        for triple in knowledge:
            self.semantic.add_knowledge(
                triple["subject"], triple["predicate"], triple["object"])

        # 4. Generate reflections
        reflection = self.llm.generate(
            f"Based on recent experiences, what important observations or lessons "
            f"are there?\n{recent_episodes}")

        # 5. Store reflections as high-importance episodes
        self.episodic.store_episode({
            "type": "reflection",
            "content": reflection,
            "importance": 9,
        })

        return {"knowledge_extracted": len(knowledge), "reflection": reflection}

Importance Assessment

Importance Scoring

Not all information is equally important. Importance scores determine memory storage tier and retention time.

def score_importance(content, llm):
    """Use LLM to assess information importance (1-10)"""
    score = llm.generate(
        f"On a scale of 1 to 10, assess the long-term importance of the "
        f"following information to an AI assistant.\n"
        f"1=completely trivial (small talk), 10=extremely important "
        f"(key preferences/knowledge).\n"
        f"Information: {content}\n"
        f"Output only the number:")
    return int(score.strip())

Rule-Based Importance Assessment

def rule_based_importance(content, metadata):
    """Quick rule-based importance assessment"""
    score = 5  # Base score

    # User explicitly asked to remember
    if "remember" in content.lower() or "don't forget" in content.lower():
        score += 3

    # Contains personal preferences
    if any(kw in content.lower() for kw in ["like", "dislike", "prefer", "hate"]):
        score += 2

    # Task success/failure experience
    if metadata.get("outcome") == "failure":
        score += 2  # Failure experiences are more worth remembering

    # Contains specific numbers or facts
    if any(char.isdigit() for char in content):
        score += 1

    return min(score, 10)

Forgetting Curve and Memory Decay

Ebbinghaus Forgetting Curve

Memory retention decays exponentially over time:

\[R = e^{-t/S}\]

where:

\(R\) is memory retention
\(t\) is elapsed time
\(S\) is memory strength, related to number of reviews and importance

Implementation

import math
from datetime import datetime

class MemoryWithDecay:
    def __init__(self):
        self.memories = []

    def compute_retention(self, memory):
        """Compute memory retention"""
        hours_since_access = (
            datetime.now() - memory["last_accessed"]
        ).total_seconds() / 3600

        # Memory strength S related to access count and importance
        strength = memory["access_count"] * 0.5 + memory["importance"] * 2

        retention = math.exp(-hours_since_access / max(strength, 1))
        return retention

    def decay_and_cleanup(self, threshold=0.1):
        """Decay and clean up memories with retention below threshold"""
        to_keep = []
        to_archive = []

        for mem in self.memories:
            retention = self.compute_retention(mem)
            mem["retention"] = retention

            if retention >= threshold:
                to_keep.append(mem)
            else:
                to_archive.append(mem)

        self.memories = to_keep
        return to_archive  # Return archived memories

    def access(self, memory_id):
        """Update metadata when accessing a memory (spaced repetition effect)"""
        for mem in self.memories:
            if mem["id"] == memory_id:
                mem["access_count"] += 1
                mem["last_accessed"] = datetime.now()
                break

Practical Design Patterns

Pattern 1: Progressive User Profile Construction

class UserProfile:
    """Progressively build user profile through conversation"""

    def __init__(self, llm):
        self.profile = {
            "preferences": {},
            "background": {},
            "goals": {},
            "communication_style": {},
        }
        self.llm = llm

    def update_from_conversation(self, messages):
        """Extract user information from conversation"""
        extracted = self.llm.generate(
            f"Extract user preferences, background, goals, and other "
            f"information from the following conversation.\n"
            f"Conversation: {messages}\n"
            f"Current profile: {self.profile}\n"
            f"Output updated profile (JSON):")
        self.profile = merge_profiles(self.profile, extracted)

Pattern 2: Task Context Persistence

class TaskContext:
    """Cross-session task context"""

    def __init__(self, store):
        self.store = store

    def save_checkpoint(self, task_id, state):
        """Save task checkpoint"""
        self.store.save({
            "task_id": task_id,
            "state": state,
            "progress": state.get("progress", 0),
            "next_steps": state.get("next_steps", []),
            "timestamp": datetime.now().isoformat(),
        })

    def resume_task(self, task_id):
        """Resume task context"""
        checkpoint = self.store.get_latest(task_id)
        if checkpoint:
            return (
                f"Task progress: {checkpoint['progress']}%\n"
                f"Next steps: {checkpoint['next_steps']}\n"
                f"Last updated: {checkpoint['timestamp']}")
        return None

Pattern 3: Memory Indexing and Tags

class TaggedMemory:
    """Tagged memory system supporting multi-dimensional retrieval"""

    def __init__(self, vector_store):
        self.vector_store = vector_store

    def store(self, content, tags, importance=5):
        self.vector_store.add(
            documents=[content],
            metadatas=[{
                "tags": ",".join(tags),
                "importance": importance,
                "timestamp": datetime.now().isoformat(),
            }],
            ids=[generate_id()]
        )

    def search_by_tag(self, tag, query=None, n=10):
        """Search by tag, optionally combined with semantic search"""
        filters = {"tags": {"$contains": tag}}

        if query:
            return self.vector_store.query(
                query_texts=[query], n_results=n, where=filters)
        else:
            return self.vector_store.get(where=filters, limit=n)

Comprehensive Architecture Example

graph TB
    subgraph "Agent Comprehensive Memory Architecture"
        INPUT[Input] --> ROUTER{Memory Router}

        ROUTER -->|Store| IMP[Importance Assessment]
        IMP -->|High| HOT[Hot Tier<br/>Core Memory]
        IMP -->|Medium| WARM[Warm Tier<br/>Vector Cache]
        IMP -->|Low| COLD[Cold Tier<br/>Archival Storage]

        ROUTER -->|Retrieve| QP[Query Processing]
        QP --> SEARCH[Tiered Search]
        HOT --> SEARCH
        WARM --> SEARCH
        COLD --> SEARCH
        SEARCH --> RERANK[Reranking]
        RERANK --> CTX[Context Injection]

        subgraph "Background Tasks"
            CONS[Memory Consolidation<br/>Periodic]
            DECAY[Decay Cleanup<br/>Periodic]
            CONS --> WARM
            DECAY --> COLD
        end
    end

    CTX --> LLM[LLM Reasoning]
    LLM --> OUTPUT[Output]
    OUTPUT -->|Feedback| ROUTER

Practical Recommendations

Design Checklist

[ ] Define clear tiering strategy (hot/warm/cold)
[ ] Implement importance assessment mechanism
[ ] Design forgetting/decay strategy to prevent memory bloat
[ ] Implement memory consolidation pipeline (episodic to semantic)
[ ] Add memory conflict detection and resolution mechanism
[ ] Monitor memory system retrieval quality and latency

Considerations

Privacy: User memory data needs proper protection
Consistency: Avoid storing mutually contradictory memories
Interpretability: Users should be able to view and manage the agent's "memories" about them
Cost: Storage and query costs of vector databases need to be controlled

References

Packer, C., et al. (2023). "MemGPT: Towards LLMs as Operating Systems"
Park, J. S., et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior"
Ebbinghaus, H. (1885). "Uber das Gedachtnis"
Zhang, Z., et al. (2024). "A Survey on the Memory Mechanism of Large Language Model based Agents"
Modarressi, A., et al. (2024). "MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory"