Virtual World Simulation Engines
Overview
Virtual world simulation engines provide the operating environment for virtual embodied agents. From simple 2D grid worlds to complex 3D physics simulations, different engines suit different research and application scenarios.
Smallville Architecture
Stanford's Generative Agents project used a 2D grid world called Smallville as its simulation environment.
World Structure
graph TD
subgraph Smallville 2D Tile World
A[World Map<br/>Grid Map] --> B[Zones]
B --> C1[Residential Area<br/>Lin House / Moreno House / ...]
B --> C2[Commercial Area<br/>Pharmacy / Cafe / ...]
B --> C3[Public Area<br/>Park / School / ...]
C1 --> D1[Rooms: Bedroom / Kitchen / Living Room]
D1 --> E1[Objects: Bed / Refrigerator / Sofa]
end
subgraph Agent Loop
F[Perceive] --> G[Retrieve]
G --> H[Plan]
H --> I[Reflect]
I --> J[Act]
J --> F
end
A -.-> F
J -.-> A
Environment Tree Structure
Smallville's environment is organized as a tree structure:
Smallville
├── Lin Family House
│ ├── Bedroom
│ │ ├── Bed (sleeping, making bed)
│ │ ├── Desk (writing, reading)
│ │ └── Closet (getting dressed)
│ ├── Kitchen
│ │ ├── Stove (cooking)
│ │ ├── Refrigerator (getting food)
│ │ └── Table (eating)
│ └── Living Room
│ ├── Sofa (relaxing, chatting)
│ └── TV (watching)
├── Hobbs Cafe
│ ├── Counter (ordering)
│ ├── Tables (eating, socializing)
│ └── Kitchen (preparing food)
└── ...
Each object (leaf node) carries a set of affordances; agents can only perform actions supported by the object.
Agent Loop
At each simulation timestep (typically 1 minute), each agent executes the following loop:
- Perceive: Obtain environmental state and other agents within the field of view
- Retrieve: Retrieve relevant memories from the memory stream
- Plan: Generate or update the action plan
- Act: Execute the current action from the plan
- Reflect: Conditionally triggered higher-level thinking
Unity ML-Agents
Unity ML-Agents Toolkit is an open-source framework for training and deploying agents within the Unity game engine.
Architecture
graph LR
subgraph Unity Environment
A[Agent] --> B[Sensors<br/>Visual / Ray / Vector]
A --> C[Actions<br/>Discrete / Continuous]
A --> D[Rewards<br/>Reward Signal]
end
subgraph Python Training
E[Trainer<br/>PPO / SAC / MA-POCA]
F[TensorBoard<br/>Visualization]
end
B --> E
E --> C
D --> E
E --> F
Key Features
| Feature | Description |
|---|---|
| Sensor types | Vector observation, visual observation (camera), ray perception |
| Action types | Discrete, continuous, hybrid actions |
| Training algorithms | PPO, SAC, MA-POCA (multi-agent) |
| Inference mode | ONNX model export, runs directly in Unity |
| Curriculum learning | Supports automatic difficulty adjustment |
Typical Application
// Unity ML-Agents agent example
public class NavigationAgent : Agent
{
public override void CollectObservations(VectorSensor sensor)
{
// Collect observations: position, velocity, target direction
sensor.AddObservation(transform.localPosition);
sensor.AddObservation(rb.velocity);
sensor.AddObservation(target.localPosition - transform.localPosition);
}
public override void OnActionReceived(ActionBuffers actions)
{
// Execute actions: movement
float moveX = actions.ContinuousActions[0];
float moveZ = actions.ContinuousActions[1];
rb.AddForce(new Vector3(moveX, 0, moveZ) * speed);
// Compute reward
float distance = Vector3.Distance(transform.localPosition,
target.localPosition);
if (distance < 1.42f) {
SetReward(1.0f);
EndEpisode();
}
}
}
Unreal Engine + AI
Unreal Engine provides high-fidelity 3D environments suitable for embodied agent research requiring realistic visuals.
Key Components
- AI Controller: Core class controlling NPC behavior
- Behavior Tree: Built-in behavior tree system
- Environment Query System (EQS): Environmental perception queries
- Navigation Mesh (NavMesh): Automatic pathfinding
- Perception System: Visual/auditory perception simulation
NVIDIA ACE Integration
NVIDIA Avatar Cloud Engine (ACE) integrates with Unreal Engine to provide:
- Audio2Face: Voice-driven facial animation
- Riva ASR/TTS: Speech recognition and synthesis
- NeMo LLM: Dialogue generation
- Omniverse: Physics simulation
PettingZoo Multi-Agent Environments
PettingZoo is the standard API library for multi-agent reinforcement learning:
from pettingzoo.classic import chess_v6
# Create environment
env = chess_v6.env()
env.reset()
# AEC (Agent Environment Cycle) API
for agent in env.agent_iter():
observation, reward, termination, truncation, info = env.last()
if termination or truncation:
action = None
else:
action = policy(observation) # Agent decision
env.step(action)
Environment Categories
| Category | Examples | Characteristics |
|---|---|---|
| Classic | Go, chess, poker | Complete/incomplete information games |
| Atari | Pong, Space Invaders | Pixel observations, multiplayer |
| Butterfly | Cooperative pursuit | Cooperative tasks |
| MPE | Simple tag, communication | Continuous space, communication |
| SISL | Traffic, waterway | Social simulation |
Environment Design Principles
Observation Space Design
What an agent can perceive determines what it can do:
- Visual observation: Rendered images or structured scene descriptions
- Spatial observation: Position, distance, direction
- Social observation: State and behavior of other agents
- Internal observation: Own state (hunger, fatigue, mood)
Action Space Design
- Movement actions: Navigation, pathfinding
- Interaction actions: Using objects, manipulating the environment
- Communication actions: Language, non-verbal signals
Reward Design
For LLM-driven agents, traditional numerical rewards are replaced by natural language feedback:
| Paradigm | Signal Form | Use Case |
|---|---|---|
| RL reward | \(r \in \mathbb{R}\) | Training phase |
| Language feedback | Natural language evaluation | LLM agents |
| Social feedback | Other agents' reactions | Social simulation |
| Intrinsic motivation | Curiosity / novelty | Exploration-driven |
Simulation Engine Comparison
| Engine | Dimension | Physics | LLM Integration | Multi-Agent | Open Source | Use Case |
|---|---|---|---|---|---|---|
| Smallville | 2D | None | Native | 25 agents | Yes | Social simulation research |
| Unity ML-Agents | 3D | Yes | Extensible | Supported | Yes | Game AI / General |
| Unreal Engine | 3D | High-fidelity | Via ACE | Supported | Partial | AAA games / High-fidelity |
| PettingZoo | 2D/Abstract | None | Extensible | Native | Yes | MARL research |
| AI Habitat | 3D | Yes | Extensible | Supported | Yes | Embodied navigation |
| Minecraft | 3D | Yes | Via API | Supported | No | Open world exploration |
Performance and Scalability
Simulation Speed
Simulation speed is a key bottleneck, especially when each agent requires an LLM call per step:
Optimization strategies:
- Asynchronous LLM calls: Process multiple agents' LLM requests in parallel
- Caching: Use cached LLM responses for similar situations
- Hierarchical timesteps: Different decision levels use different frequencies
- Selective updates: Only agents with state changes trigger LLM calls
Scalability Challenges
For \(N = 25\) agents running a 2-day simulation (~2880 steps), Park et al. reported thousands of dollars in API costs.
Summary
Choosing a simulation engine requires balancing:
- Research goals: Social simulation favors Smallville-type; embodied manipulation favors Unity/Unreal
- Fidelity requirements: High fidelity favors Unreal; rapid iteration favors 2D environments
- Agent scale: Large scale favors lightweight frameworks like PettingZoo
- LLM integration: Social simulation has native support; others require custom integration