Virtual World Simulation Engines

Overview

Virtual world simulation engines provide the operating environment for virtual embodied agents. From simple 2D grid worlds to complex 3D physics simulations, different engines suit different research and application scenarios.

Smallville Architecture

Stanford's Generative Agents project used a 2D grid world called Smallville as its simulation environment.

World Structure

graph TD
    subgraph Smallville 2D Tile World
        A[World Map<br/>Grid Map] --> B[Zones]
        B --> C1[Residential Area<br/>Lin House / Moreno House / ...]
        B --> C2[Commercial Area<br/>Pharmacy / Cafe / ...]
        B --> C3[Public Area<br/>Park / School / ...]

        C1 --> D1[Rooms: Bedroom / Kitchen / Living Room]
        D1 --> E1[Objects: Bed / Refrigerator / Sofa]
    end

    subgraph Agent Loop
        F[Perceive] --> G[Retrieve]
        G --> H[Plan]
        H --> I[Reflect]
        I --> J[Act]
        J --> F
    end

    A -.-> F
    J -.-> A

Environment Tree Structure

Smallville's environment is organized as a tree structure:

Smallville
├── Lin Family House
│   ├── Bedroom
│   │   ├── Bed (sleeping, making bed)
│   │   ├── Desk (writing, reading)
│   │   └── Closet (getting dressed)
│   ├── Kitchen
│   │   ├── Stove (cooking)
│   │   ├── Refrigerator (getting food)
│   │   └── Table (eating)
│   └── Living Room
│       ├── Sofa (relaxing, chatting)
│       └── TV (watching)
├── Hobbs Cafe
│   ├── Counter (ordering)
│   ├── Tables (eating, socializing)
│   └── Kitchen (preparing food)
└── ...

Each object (leaf node) carries a set of affordances; agents can only perform actions supported by the object.

Agent Loop

At each simulation timestep (typically 1 minute), each agent executes the following loop:

\[\text{Agent Step} = \text{Perceive}(E_t) \rightarrow \text{Retrieve}(M) \rightarrow \text{Plan}(P) \rightarrow \text{Act}(A) \rightarrow E_{t+1}\]

Perceive: Obtain environmental state and other agents within the field of view
Retrieve: Retrieve relevant memories from the memory stream
Plan: Generate or update the action plan
Act: Execute the current action from the plan
Reflect: Conditionally triggered higher-level thinking

Unity ML-Agents

Unity ML-Agents Toolkit is an open-source framework for training and deploying agents within the Unity game engine.

Architecture

graph LR
    subgraph Unity Environment
        A[Agent] --> B[Sensors<br/>Visual / Ray / Vector]
        A --> C[Actions<br/>Discrete / Continuous]
        A --> D[Rewards<br/>Reward Signal]
    end

    subgraph Python Training
        E[Trainer<br/>PPO / SAC / MA-POCA]
        F[TensorBoard<br/>Visualization]
    end

    B --> E
    E --> C
    D --> E
    E --> F

Key Features

Feature	Description
Sensor types	Vector observation, visual observation (camera), ray perception
Action types	Discrete, continuous, hybrid actions
Training algorithms	PPO, SAC, MA-POCA (multi-agent)
Inference mode	ONNX model export, runs directly in Unity
Curriculum learning	Supports automatic difficulty adjustment

Typical Application

// Unity ML-Agents agent example
public class NavigationAgent : Agent
{
    public override void CollectObservations(VectorSensor sensor)
    {
        // Collect observations: position, velocity, target direction
        sensor.AddObservation(transform.localPosition);
        sensor.AddObservation(rb.velocity);
        sensor.AddObservation(target.localPosition - transform.localPosition);
    }

    public override void OnActionReceived(ActionBuffers actions)
    {
        // Execute actions: movement
        float moveX = actions.ContinuousActions[0];
        float moveZ = actions.ContinuousActions[1];
        rb.AddForce(new Vector3(moveX, 0, moveZ) * speed);

        // Compute reward
        float distance = Vector3.Distance(transform.localPosition, 
                                          target.localPosition);
        if (distance < 1.42f) {
            SetReward(1.0f);
            EndEpisode();
        }
    }
}

Unreal Engine + AI

Unreal Engine provides high-fidelity 3D environments suitable for embodied agent research requiring realistic visuals.

Key Components

AI Controller: Core class controlling NPC behavior
Behavior Tree: Built-in behavior tree system
Environment Query System (EQS): Environmental perception queries
Navigation Mesh (NavMesh): Automatic pathfinding
Perception System: Visual/auditory perception simulation

NVIDIA ACE Integration

NVIDIA Avatar Cloud Engine (ACE) integrates with Unreal Engine to provide:

Audio2Face: Voice-driven facial animation
Riva ASR/TTS: Speech recognition and synthesis
NeMo LLM: Dialogue generation
Omniverse: Physics simulation

PettingZoo Multi-Agent Environments

PettingZoo is the standard API library for multi-agent reinforcement learning:

from pettingzoo.classic import chess_v6

# Create environment
env = chess_v6.env()
env.reset()

# AEC (Agent Environment Cycle) API
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        action = None
    else:
        action = policy(observation)  # Agent decision

    env.step(action)

Environment Categories

Category	Examples	Characteristics
Classic	Go, chess, poker	Complete/incomplete information games
Atari	Pong, Space Invaders	Pixel observations, multiplayer
Butterfly	Cooperative pursuit	Cooperative tasks
MPE	Simple tag, communication	Continuous space, communication
SISL	Traffic, waterway	Social simulation

Environment Design Principles

Observation Space Design

What an agent can perceive determines what it can do:

\[\mathcal{O} = \{o_{\text{visual}}, o_{\text{spatial}}, o_{\text{social}}, o_{\text{internal}}\}\]

Visual observation: Rendered images or structured scene descriptions
Spatial observation: Position, distance, direction
Social observation: State and behavior of other agents
Internal observation: Own state (hunger, fatigue, mood)

Action Space Design

\[\mathcal{A} = \mathcal{A}_{\text{movement}} \times \mathcal{A}_{\text{interaction}} \times \mathcal{A}_{\text{communication}}\]

Movement actions: Navigation, pathfinding
Interaction actions: Using objects, manipulating the environment
Communication actions: Language, non-verbal signals

Reward Design

For LLM-driven agents, traditional numerical rewards are replaced by natural language feedback:

Paradigm	Signal Form	Use Case
RL reward	\(r \in \mathbb{R}\)	Training phase
Language feedback	Natural language evaluation	LLM agents
Social feedback	Other agents' reactions	Social simulation
Intrinsic motivation	Curiosity / novelty	Exploration-driven

Simulation Engine Comparison

Engine	Dimension	Physics	LLM Integration	Multi-Agent	Open Source	Use Case
Smallville	2D	None	Native	25 agents	Yes	Social simulation research
Unity ML-Agents	3D	Yes	Extensible	Supported	Yes	Game AI / General
Unreal Engine	3D	High-fidelity	Via ACE	Supported	Partial	AAA games / High-fidelity
PettingZoo	2D/Abstract	None	Extensible	Native	Yes	MARL research
AI Habitat	3D	Yes	Extensible	Supported	Yes	Embodied navigation
Minecraft	3D	Yes	Via API	Supported	No	Open world exploration

Performance and Scalability

Simulation Speed

Simulation speed is a key bottleneck, especially when each agent requires an LLM call per step:

\[T_{\text{step}} = \max_{i \in \text{agents}} \left( T_{\text{perceive}}^i + T_{\text{LLM}}^i + T_{\text{act}}^i \right)\]

Optimization strategies:

Asynchronous LLM calls: Process multiple agents' LLM requests in parallel
Caching: Use cached LLM responses for similar situations
Hierarchical timesteps: Different decision levels use different frequencies
Selective updates: Only agents with state changes trigger LLM calls

Scalability Challenges

\[\text{Cost} = N_{\text{agents}} \times K_{\text{LLM calls/step}} \times C_{\text{per call}} \times T_{\text{total steps}}\]

For \(N = 25\) agents running a 2-day simulation (~2880 steps), Park et al. reported thousands of dollars in API costs.

Summary

Choosing a simulation engine requires balancing:

Research goals: Social simulation favors Smallville-type; embodied manipulation favors Unity/Unreal
Fidelity requirements: High fidelity favors Unreal; rapid iteration favors 2D environments
Agent scale: Large scale favors lightweight frameworks like PettingZoo
LLM integration: Social simulation has native support; others require custom integration