Generative Agents Introduction
If we take AGI as the ultimate goal, two distinct schools of thought emerge regarding the role that will bear this ultimate intellectual capability:
- Embodied AI — whether virtual or physical, the ultimate aim is to achieve human-level capabilities.
- Generative Agents — akin to the virtual humans in The Matrix or the intelligent NPCs in Westworld.
I chose the term "Generative Agents" from the research by Dr. Joon Sung Park, published in 2023, on a simulated town of intelligent NPCs — Generative Agents: Interactive Simulacra of Human Behavior — to distinguish these from traditional NPCs whose behavioral patterns are manually scripted.
This survey article takes Joon Sung Park's doctoral dissertation, defended this past August, as its point of departure to trace the past and present of Generative Agents. Joon launched a new course at Stanford in Fall 2024 called CS 222: AI Agents and Simulations. In the course materials, Joon candidly stated his vision:
A world simulator of 8 billions.
This is not Joon's vision alone — it is a shared aspiration among virtual-world enthusiasts, game developers, and film and television creators. The Matrix, Digimon... one film after another about virtual worlds has showcased humanity's beautiful fantasies in this direction. I believe virtual worlds embody an idealistic aspiration: because people recognize that reality is imperfect and falls short of the ideal, they yearn to build "an ideal world" within a virtual one.
Related Background
A Brief History of Virtual Worlds
Humanity's exploration of virtual worlds has a long history. From early text-based MUDs to modern 3D open worlds, this field has evolved over several decades:
- The Sims (2000): Developed by Maxis, this life simulation game was a milestone in virtual character behavior simulation. Players set needs and personality traits for virtual characters, who then act autonomously based on predefined rules. However, NPC behavior in The Sims is fundamentally rule-based, driven by finite state machines and need hierarchies, lacking genuine "understanding" or "creativity"
- Second Life (2003): Developed by Linden Lab, this online virtual world allowed users to create avatars and engage in social interaction, commerce, and construction. It demonstrated the social potential of virtual worlds, but its NPCs remained script-driven
- The Metaverse Wave (2021-2022): Meta's (Facebook's) push for the metaverse brought virtual worlds back into the spotlight. However, the technology at the time was still insufficient to support truly intelligent virtual inhabitants
Traditional NPC Behavior Modeling
Prior to Generative Agents, NPC behavior in games and simulations relied primarily on the following methods:
- Finite State Machines (FSM): NPCs switch between predefined states (e.g., patrol, chase, attack), with fixed and predictable behavior patterns
- Behavior Trees: A more flexible decision structure, but one that still requires manually designing all possible behavioral branches
- Utility Systems: Score each possible action based on utility functions and select the highest-scoring action
- GOAP (Goal-Oriented Action Planning): A goal-based planning system that can automatically search for action sequences to achieve objectives
The common limitation of these methods is that NPC behavioral space is constrained by rules and patterns predetermined by developers, unable to produce truly emergent behavior. The breakthrough contribution of Generative Agents lies in replacing these hand-crafted rules with large language models, enabling virtual characters to "think" and "decide" based on their experiences.
Cognitive Architecture
Joon Sung Park et al. proposed a complete cognitive architecture in their paper Generative Agents: Interactive Simulacra of Human Behavior, consisting of three core modules:
Memory Stream
The Memory Stream is the foundation of the entire architecture. It records all of an agent's experiences as natural language entries. Each memory entry contains the following attributes:
- Description: A natural language description of the event, e.g., "Isabella Rodriguez is decorating the coffee shop for a Valentine's Day party"
- Creation Timestamp: The time at which the memory was created
- Last Access Timestamp: The time at which the memory was last retrieved
The design philosophy behind the Memory Stream is that all of an agent's perceptions — observed events, conversations with others, and its own actions — are stored in a unified format, forming a continuously growing experience database.
Retrieval Mechanism
When an agent needs to make a decision, it cannot possibly review all historical memories — this is neither feasible nor necessary. The retrieval mechanism is responsible for extracting the most relevant memories from the Memory Stream. Retrieval scoring is based on three dimensions:
- Recency: More recent memories receive higher scores, using an exponential decay function
- Importance: Memories are scored on a 1-10 scale for significance (judged by the LLM). Mundane activities (e.g., eating breakfast) score low, while major events (e.g., a breakup, getting a new job) score high
- Relevance: The semantic similarity between the memory content and the current context, computed via cosine similarity of embedding vectors
Final retrieval score = \(\alpha_{\text{recency}} \cdot \text{recency} + \alpha_{\text{importance}} \cdot \text{importance} + \alpha_{\text{relevance}} \cdot \text{relevance}\)
Reflection
If an agent merely stored and retrieved raw observations, it would lack higher-level understanding of its experiences. The Reflection module periodically abstracts and synthesizes accumulated memories to generate higher-order insights:
- Trigger condition: When the agent's cumulative importance score exceeds a threshold, the reflection process is triggered
- Generating reflection questions: Based on recent memories, the LLM generates several thought-provoking questions (e.g., "What has Isabella Rodriguez been primarily focused on recently?")
- Producing reflection conclusions: Relevant memories are retrieved from the Memory Stream, and higher-level inferences are generated based on this evidence (e.g., "Isabella Rodriguez is passionate about community building")
- Storing back to Memory Stream: The reflection conclusions themselves are stored as new memory entries in the Memory Stream, with relatively high importance scores
The reflection mechanism enables agents to abstract general beliefs and attitudes from specific facts — precisely the cognitive process by which humans distill knowledge from experience.
Simulations
The Smallville Experiment
Joon Sung Park's team constructed a sandbox environment called Smallville to validate the capabilities of Generative Agents. Smallville is a small town reminiscent of The Sims, containing locations such as a coffee shop, park, residences, and a school.
Experimental setup:
- 25 Generative Agents: Each agent has an independent background identity (e.g., "Isabella Rodriguez is the owner of Hobbs Cafe and is passionate about making the community a better place")
- Simulation period: Two virtual days of continuous life
- Interaction mode: Agents autonomously plan schedules, move through the town, converse with others, and form relationships
Emergent Behavior Examples
The experiment revealed several remarkable emergent behaviors — behaviors that were not pre-programmed but arose naturally from agents' memory and reasoning processes:
- Information diffusion: When one agent told another about a Valentine's Day party, the news gradually spread throughout the community through inter-agent conversations
- Social relationship evolution: Two agents who were previously strangers gradually developed a friendship after multiple chance encounters and conversations
- Coordinated behavior: Multiple agents spontaneously organized the preparation for a Valentine's Day party without any central coordination — some decorated the venue while others invited friends
- Schedule adjustment: Agents adjusted their daily schedules in response to newly acquired information (e.g., upon learning about the party time, modifying their plans to attend)
Evaluation Methods
The paper employed multiple evaluation approaches to verify the behavioral credibility of Generative Agents:
- Human evaluation: Human evaluators judged whether agents' behaviors seemed "reasonably human-like"
- Ablation study: Individually removing modules such as memory retrieval, reflection, and planning to observe the degree of behavioral quality degradation. The experiments showed that removing any single module significantly reduced behavioral credibility
- Emergent behavior analysis: Qualitative analysis of coordinated behaviors and social dynamics that arose within the agent population
Construction
Technology Stack
Building a Generative Agents system requires the following core technical components:
- Large Language Model (LLM): Serves as the agent's "brain," responsible for generating action plans, dialogue content, and reflections. The original paper used ChatGPT (GPT-3.5-turbo); current implementations can leverage more powerful models such as GPT-4 or Claude
- Memory system: - Use vector databases (e.g., ChromaDB, FAISS) to store embedding representations of memories - Implement the three-dimensional retrieval algorithm based on recency, importance, and relevance - Support generation and storage of reflective memories
- Environment interface: - Define the virtual world's map, locations, and objects - Implement agent movement, interaction, and perception logic - Manage time progression and event triggering
- Agent scheduler: Coordinate concurrent actions and interactions among multiple agents
Construction Challenges
Building a Generative Agents system in practice faces several major challenges:
- Cost: Every decision by every agent requires an LLM call. Running a two-day simulation with 25 agents consumed thousands of API calls at considerable expense
- Latency: LLM inference latency makes real-time simulation difficult, requiring trade-offs between response speed and behavioral quality
- Memory management: As simulation time grows, the Memory Stream expands continuously, making retrieval efficiency and relevance maintenance increasingly challenging
- Behavioral consistency: Ensuring that an agent's long-term behavior remains consistent with its configured personality and background, avoiding "character collapse"
- Scalability: Scaling from 25 agents to Joon's envisioned "simulator of 8 billion" presents enormous architectural and computational challenges
- Evaluation difficulty: There is a lack of standardized metrics for measuring the "authenticity" and "plausibility" of virtual agent behavior
Value and Significance
Joon's discussion of the value and significance of Generative Agents is primarily from a societal perspective.