Generative Agents Introduction

If we take AGI as the ultimate goal, two distinct schools of thought emerge regarding the role that will bear this ultimate intellectual capability:

Embodied AI — whether virtual or physical, the ultimate aim is to achieve human-level capabilities.
Generative Agents — akin to the virtual humans in The Matrix or the intelligent NPCs in Westworld.

I chose the term "Generative Agents" from the research by Dr. Joon Sung Park, published in 2023, on a simulated town of intelligent NPCs — Generative Agents: Interactive Simulacra of Human Behavior — to distinguish these from traditional NPCs whose behavioral patterns are manually scripted.

This survey article takes Joon Sung Park's doctoral dissertation, defended this past August, as its point of departure to trace the past and present of Generative Agents. Joon launched a new course at Stanford in Fall 2024 called CS 222: AI Agents and Simulations. In the course materials, Joon candidly stated his vision:

A world simulator of 8 billions.

This is not Joon's vision alone — it is a shared aspiration among virtual-world enthusiasts, game developers, and film and television creators. The Matrix, Digimon... one film after another about virtual worlds has showcased humanity's beautiful fantasies in this direction. I believe virtual worlds embody an idealistic aspiration: because people recognize that reality is imperfect and falls short of the ideal, they yearn to build "an ideal world" within a virtual one.

A Brief History of Virtual Worlds

Humanity's exploration of virtual worlds has a long history. From early text-based MUDs to modern 3D open worlds, this field has evolved over several decades:

The Sims (2000): Developed by Maxis, this life simulation game was a milestone in virtual character behavior simulation. Players set needs and personality traits for virtual characters, who then act autonomously based on predefined rules. However, NPC behavior in The Sims is fundamentally rule-based, driven by finite state machines and need hierarchies, lacking genuine "understanding" or "creativity"
Second Life (2003): Developed by Linden Lab, this online virtual world allowed users to create avatars and engage in social interaction, commerce, and construction. It demonstrated the social potential of virtual worlds, but its NPCs remained script-driven
The Metaverse Wave (2021-2022): Meta's (Facebook's) push for the metaverse brought virtual worlds back into the spotlight. However, the technology at the time was still insufficient to support truly intelligent virtual inhabitants

Traditional NPC Behavior Modeling

Prior to Generative Agents, NPC behavior in games and simulations relied primarily on the following methods:

Finite State Machines (FSM): NPCs switch between predefined states (e.g., patrol, chase, attack), with fixed and predictable behavior patterns
Behavior Trees: A more flexible decision structure, but one that still requires manually designing all possible behavioral branches
Utility Systems: Score each possible action based on utility functions and select the highest-scoring action
GOAP (Goal-Oriented Action Planning): A goal-based planning system that can automatically search for action sequences to achieve objectives

The common limitation of these methods is that NPC behavioral space is constrained by rules and patterns predetermined by developers, unable to produce truly emergent behavior. The breakthrough contribution of Generative Agents lies in replacing these hand-crafted rules with large language models, enabling virtual characters to "think" and "decide" based on their experiences.

Cognitive Architecture

Joon Sung Park et al. proposed a complete cognitive architecture in their paper Generative Agents: Interactive Simulacra of Human Behavior, consisting of three core modules:

Memory Stream

The Memory Stream is the foundation of the entire architecture. It records all of an agent's experiences as natural language entries. Each memory entry contains the following attributes:

Description: A natural language description of the event, e.g., "Isabella Rodriguez is decorating the coffee shop for a Valentine's Day party"
Creation Timestamp: The time at which the memory was created
Last Access Timestamp: The time at which the memory was last retrieved

The design philosophy behind the Memory Stream is that all of an agent's perceptions — observed events, conversations with others, and its own actions — are stored in a unified format, forming a continuously growing experience database.

Retrieval Mechanism

When an agent needs to make a decision, it cannot possibly review all historical memories — this is neither feasible nor necessary. The retrieval mechanism is responsible for extracting the most relevant memories from the Memory Stream. Retrieval scoring is based on three dimensions:

Recency: More recent memories receive higher scores, using an exponential decay function
Importance: Memories are scored on a 1-10 scale for significance (judged by the LLM). Mundane activities (e.g., eating breakfast) score low, while major events (e.g., a breakup, getting a new job) score high
Relevance: The semantic similarity between the memory content and the current context, computed via cosine similarity of embedding vectors

Final retrieval score = \(\alpha_{\text{recency}} \cdot \text{recency} + \alpha_{\text{importance}} \cdot \text{importance} + \alpha_{\text{relevance}} \cdot \text{relevance}\)

Reflection

If an agent merely stored and retrieved raw observations, it would lack higher-level understanding of its experiences. The Reflection module periodically abstracts and synthesizes accumulated memories to generate higher-order insights:

Trigger condition: When the agent's cumulative importance score exceeds a threshold, the reflection process is triggered
Generating reflection questions: Based on recent memories, the LLM generates several thought-provoking questions (e.g., "What has Isabella Rodriguez been primarily focused on recently?")
Producing reflection conclusions: Relevant memories are retrieved from the Memory Stream, and higher-level inferences are generated based on this evidence (e.g., "Isabella Rodriguez is passionate about community building")
Storing back to Memory Stream: The reflection conclusions themselves are stored as new memory entries in the Memory Stream, with relatively high importance scores

The reflection mechanism enables agents to abstract general beliefs and attitudes from specific facts — precisely the cognitive process by which humans distill knowledge from experience.

Simulations

The Smallville Experiment

Joon Sung Park's team constructed a sandbox environment called Smallville to validate the capabilities of Generative Agents. Smallville is a small town reminiscent of The Sims, containing locations such as a coffee shop, park, residences, and a school.

Experimental setup:

25 Generative Agents: Each agent has an independent background identity (e.g., "Isabella Rodriguez is the owner of Hobbs Cafe and is passionate about making the community a better place")
Simulation period: Two virtual days of continuous life
Interaction mode: Agents autonomously plan schedules, move through the town, converse with others, and form relationships

Emergent Behavior Examples

The experiment revealed several remarkable emergent behaviors — behaviors that were not pre-programmed but arose naturally from agents' memory and reasoning processes:

Information diffusion: When one agent told another about a Valentine's Day party, the news gradually spread throughout the community through inter-agent conversations
Social relationship evolution: Two agents who were previously strangers gradually developed a friendship after multiple chance encounters and conversations
Coordinated behavior: Multiple agents spontaneously organized the preparation for a Valentine's Day party without any central coordination — some decorated the venue while others invited friends
Schedule adjustment: Agents adjusted their daily schedules in response to newly acquired information (e.g., upon learning about the party time, modifying their plans to attend)

Evaluation Methods

The paper employed multiple evaluation approaches to verify the behavioral credibility of Generative Agents:

Human evaluation: Human evaluators judged whether agents' behaviors seemed "reasonably human-like"
Ablation study: Individually removing modules such as memory retrieval, reflection, and planning to observe the degree of behavioral quality degradation. The experiments showed that removing any single module significantly reduced behavioral credibility
Emergent behavior analysis: Qualitative analysis of coordinated behaviors and social dynamics that arose within the agent population

Construction

Technology Stack

Building a Generative Agents system requires the following core technical components:

Large Language Model (LLM): Serves as the agent's "brain," responsible for generating action plans, dialogue content, and reflections. The original paper used ChatGPT (GPT-3.5-turbo); current implementations can leverage more powerful models such as GPT-4 or Claude
Memory system: - Use vector databases (e.g., ChromaDB, FAISS) to store embedding representations of memories - Implement the three-dimensional retrieval algorithm based on recency, importance, and relevance - Support generation and storage of reflective memories
Environment interface: - Define the virtual world's map, locations, and objects - Implement agent movement, interaction, and perception logic - Manage time progression and event triggering
Agent scheduler: Coordinate concurrent actions and interactions among multiple agents

Construction Challenges

Building a Generative Agents system in practice faces several major challenges:

Cost: Every decision by every agent requires an LLM call. Running a two-day simulation with 25 agents consumed thousands of API calls at considerable expense
Latency: LLM inference latency makes real-time simulation difficult, requiring trade-offs between response speed and behavioral quality
Memory management: As simulation time grows, the Memory Stream expands continuously, making retrieval efficiency and relevance maintenance increasingly challenging
Behavioral consistency: Ensuring that an agent's long-term behavior remains consistent with its configured personality and background, avoiding "character collapse"
Scalability: Scaling from 25 agents to Joon's envisioned "simulator of 8 billion" presents enormous architectural and computational challenges
Evaluation difficulty: There is a lack of standardized metrics for measuring the "authenticity" and "plausibility" of virtual agent behavior

Value and Significance

Joon's discussion of the value and significance of Generative Agents is primarily from a societal perspective.