Memory Streams and Reflection Mechanisms
Overview
The Memory Stream is the core memory architecture proposed by Park et al. (2023) in Generative Agents. It provides virtual embodied agents with a complete record of experiences and achieves efficient memory utilization through tri-factor retrieval scoring and reflection mechanisms.
Related Content
For foundational theories on episodic and semantic memory, see Episodic and Semantic Memory.
Memory Stream Architecture
The memory stream is a temporally ordered list of memory objects, each containing:
class MemoryObject:
"""A single memory entry in the memory stream"""
def __init__(self):
self.description: str # Natural language description
self.creation_time: float # Creation timestamp
self.last_access: float # Last access time
self.importance: int # Importance score (1-10)
self.embedding: list # Semantic embedding vector
self.type: str # "observation" | "reflection" | "plan"
self.related_ids: list # List of related memory IDs
graph TD
subgraph Memory Stream
M1[Observation: Saw John at the library<br/>t=8:00, imp=2]
M2[Observation: Discussed project with Maria<br/>t=9:30, imp=5]
M3[Observation: Heard about the party<br/>t=10:15, imp=6]
M4[Reflection: John has been at the library often lately<br/>t=10:30, imp=7]
M5[Observation: Received meeting invitation<br/>t=11:00, imp=4]
M6[Reflection: Should invite Maria to the party<br/>t=11:30, imp=8]
end
M1 --> M4
M3 --> M4
M3 --> M6
M2 --> M6
Tri-Factor Retrieval Scoring
When the agent needs to retrieve relevant memories from the memory stream, it uses a weighted combination of three factors:
where: - \(m\) is the memory object - \(q\) is the current query (current context) - \(\alpha, \beta, \gamma\) are tunable weight hyperparameters
Factor 1: Recency
Recency scoring uses an exponential decay function, with more recent memories scoring higher:
where: - \(\Delta t = t_{\text{now}} - t_{\text{last\_access}}(m)\) is the time since last access - \(\lambda\) is the decay rate parameter
import math
def recency_score(memory, current_time, decay_rate=0.995):
"""Compute recency score
Args:
memory: Memory object
current_time: Current time (hours)
decay_rate: Decay rate; smaller values mean faster decay
Returns:
Recency score between 0 and 1
"""
hours_passed = current_time - memory.last_access
return math.exp(-decay_rate * hours_passed)
The choice of decay rate \(\lambda\) affects the memory's "shelf life":
| \(\lambda\) | Half-life | Use Case |
|---|---|---|
| 0.01 | ~69 hours | Long-term social relationships |
| 0.05 | ~14 hours | Daily events |
| 0.1 | ~7 hours | Immediate conversation |
| 0.5 | ~1.4 hours | Short-term tasks |
Factor 2: Importance
The importance score is evaluated once by the LLM at memory creation time:
Scoring Prompt:
On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing
teeth, making bed) and 10 is extremely poignant (e.g., a break up,
college acceptance), rate the likely poignancy of the following
piece of memory.
Memory: {memory_description}
Rating: <fill in>
Importance scores are normalized to \([0, 1]\):
Importance Score Examples
- "Brushing teeth" -> 1 (completely mundane)
- "Seeing a colleague at a coffee shop" -> 3 (mild social event)
- "Learning a close friend is getting married" -> 8 (significant social event)
- "Being told about losing a job" -> 10 (major life event)
Factor 3: Relevance
Relevance is computed via cosine similarity of semantic embeddings:
where \(\mathbf{e}_q\) and \(\mathbf{e}_m\) are the embedding vectors of the query and memory, respectively.
import numpy as np
def relevance_score(query_embedding, memory_embedding):
"""Compute relevance score (cosine similarity)"""
dot_product = np.dot(query_embedding, memory_embedding)
norm_product = np.linalg.norm(query_embedding) * np.linalg.norm(memory_embedding)
if norm_product == 0:
return 0.0
return dot_product / norm_product
Complete Retrieval Process
def retrieve_memories(memory_stream, query, current_time, top_k=10,
alpha=1.0, beta=1.0, gamma=1.0, decay_rate=0.995):
"""Retrieve the most relevant memories from the memory stream
Tri-factor weighted retrieval: recency + importance + relevance
"""
query_embedding = get_embedding(query)
scored_memories = []
for memory in memory_stream:
rec = recency_score(memory, current_time, decay_rate)
imp = (memory.importance - 1) / 9.0 # Normalize to [0,1]
rel = relevance_score(query_embedding, memory.embedding)
total = alpha * rec + beta * imp + gamma * rel
scored_memories.append((memory, total))
# Sort by total score, return top_k
scored_memories.sort(key=lambda x: x[1], reverse=True)
# Update access time
for memory, _ in scored_memories[:top_k]:
memory.last_access = current_time
return [m for m, _ in scored_memories[:top_k]]
Reflection Mechanism
Reflection is one of the most critical innovations in the memory stream architecture. Agents not only store raw observations but can synthesize higher-level abstract insights from multiple memories.
Reflection Trigger Condition
Reflection is triggered when the sum of importance scores of recent memories exceeds a threshold:
Typically \(\theta_{\text{reflect}} = 150\) (i.e., reflection triggers when cumulative importance reaches 150).
Reflection Generation Process
graph TD
A[Trigger Reflection] --> B[Determine Reflection Topics]
B --> C[Retrieve Related Memories<br/>top 100]
C --> D[LLM Generates High-Level Insights]
D --> E[Create Reflection Memory Object]
E --> F[Add to Memory Stream]
F --> G[Reset Importance Accumulator]
B -->|Prompt| B1["Given only the information above,<br/>what are 3 most salient high-level<br/>questions we can answer?"]
D -->|Prompt| D1["What 5 high-level insights can you<br/>infer from the above statements?"]
Reflection Hierarchy
Reflection can be performed recursively, forming multiple levels of abstraction:
- Level 0 (Observation): "Klaus is painting oil paintings in the studio"
- Level 1 (Reflection): "Klaus is passionate about art"
- Level 2 (Meta-reflection): "Klaus is a person whose core identity revolves around creativity"
def maybe_reflect(agent, threshold=150):
"""Check whether reflection should be triggered"""
recent_importance_sum = sum(
m.importance for m in agent.memory_stream
if m.creation_time > agent.last_reflection_time
)
if recent_importance_sum >= threshold:
# Generate reflection topics
recent_memories = get_recent_memories(agent, n=100)
topics = generate_reflection_topics(recent_memories)
for topic in topics:
# Retrieve memories related to the topic
relevant = retrieve_memories(agent.memory_stream, topic,
current_time=agent.current_time)
# Generate reflective insights
insights = generate_insights(relevant, topic)
for insight in insights:
# Create reflection memory object
reflection = MemoryObject()
reflection.description = insight
reflection.type = "reflection"
reflection.importance = rate_importance(insight)
reflection.creation_time = agent.current_time
reflection.related_ids = [m.id for m in relevant]
agent.memory_stream.append(reflection)
agent.last_reflection_time = agent.current_time
Ablation Study Results
Park et al. conducted systematic ablation studies on the tri-factor retrieval and reflection mechanisms:
Retrieval Factor Ablation
| Configuration | Behavior Plausibility Score | Notes |
|---|---|---|
| Full model (tri-factor + reflection) | 8.4 / 10 | Baseline |
| Remove recency | 7.2 / 10 | Agent repeatedly mentions outdated information |
| Remove importance | 7.5 / 10 | Cannot distinguish trivia from significant events |
| Remove relevance | 6.8 / 10 | Retrieves irrelevant memories |
| Recency only | 5.9 / 10 | Only remembers recent things |
| Relevance only | 6.3 / 10 | No temporal awareness |
Reflection Mechanism Ablation
| Configuration | Behavior Plausibility Score | Key Observation |
|---|---|---|
| With reflection | 8.4 / 10 | Agent demonstrates deep understanding |
| Without reflection | 6.1 / 10 | Behavior remains at surface-level reactions |
| Without reflection + without planning | 4.8 / 10 | Behavior is nearly random |
Key Finding
Removing the reflection mechanism caused the largest performance drop (-2.3 points), indicating that higher-level abstraction ability is crucial for credible agent behavior. Removing the relevance factor had the largest impact among the three retrieval factors (-1.6 points).
Engineering Optimizations for Memory Streams
Vector Indexing
When the memory stream grows to thousands of memories, brute-force retrieval becomes inefficient:
# Using FAISS for approximate nearest neighbor search
import faiss
class OptimizedMemoryStream:
def __init__(self, embedding_dim=1536):
self.index = faiss.IndexFlatIP(embedding_dim) # Inner product index
self.memories = []
def add_memory(self, memory):
self.memories.append(memory)
embedding = np.array([memory.embedding], dtype='float32')
faiss.normalize_L2(embedding) # After normalization, inner product = cosine similarity
self.index.add(embedding)
def search_relevant(self, query_embedding, top_k=50):
query = np.array([query_embedding], dtype='float32')
faiss.normalize_L2(query)
scores, indices = self.index.search(query, top_k)
return [(self.memories[i], scores[0][j])
for j, i in enumerate(indices[0])]
Memory Compression
Long-running agents need memory compression strategies:
- Forgetting: Low importance + low access frequency memories are deleted
- Merging: Similar low-level memories are merged into summaries
- Tiering: Old memories move to "long-term storage" with reduced retrieval frequency
Importance Score Caching
# Batch evaluate importance to reduce LLM calls
def batch_rate_importance(descriptions, batch_size=10):
"""Batch evaluate memory importance to reduce API calls"""
results = []
for i in range(0, len(descriptions), batch_size):
batch = descriptions[i:i+batch_size]
prompt = format_batch_importance_prompt(batch)
ratings = call_llm(prompt)
results.extend(parse_ratings(ratings))
return results
Comparison with Other Memory Systems
| Feature | Memory Stream (Park) | RAG | MemGPT | Traditional DB |
|---|---|---|---|---|
| Temporal awareness | Exponential decay | None | Limited | Queryable |
| Importance filtering | LLM scoring | None | Tiered management | Manual tagging |
| Semantic retrieval | Embedding similarity | Embedding similarity | Embedding similarity | SQL queries |
| Abstraction ability | Reflection mechanism | None | Limited | None |
| Scalability | Medium | High | Medium | High |
Summary
Core contributions of the memory stream and reflection mechanisms:
- Tri-factor retrieval provides a more human-like memory access pattern than pure semantic retrieval
- Reflection mechanism enables agents to extract high-level insights from experience rather than remaining at the surface
- Ablation studies validate the necessity of each component, especially the critical role of reflection
- This architecture established the memory system design paradigm for subsequent virtual embodied agent research