Skip to content

Virtual World Simulation Engines

Overview

Virtual world simulation engines provide the operating environment for virtual embodied agents. From simple 2D grid worlds to complex 3D physics simulations, different engines suit different research and application scenarios.

Smallville Architecture

Stanford's Generative Agents project used a 2D grid world called Smallville as its simulation environment.

World Structure

graph TD
    subgraph Smallville 2D Tile World
        A[World Map<br/>Grid Map] --> B[Zones]
        B --> C1[Residential Area<br/>Lin House / Moreno House / ...]
        B --> C2[Commercial Area<br/>Pharmacy / Cafe / ...]
        B --> C3[Public Area<br/>Park / School / ...]

        C1 --> D1[Rooms: Bedroom / Kitchen / Living Room]
        D1 --> E1[Objects: Bed / Refrigerator / Sofa]
    end

    subgraph Agent Loop
        F[Perceive] --> G[Retrieve]
        G --> H[Plan]
        H --> I[Reflect]
        I --> J[Act]
        J --> F
    end

    A -.-> F
    J -.-> A

Environment Tree Structure

Smallville's environment is organized as a tree structure:

Smallville
├── Lin Family House
│   ├── Bedroom
│   │   ├── Bed (sleeping, making bed)
│   │   ├── Desk (writing, reading)
│   │   └── Closet (getting dressed)
│   ├── Kitchen
│   │   ├── Stove (cooking)
│   │   ├── Refrigerator (getting food)
│   │   └── Table (eating)
│   └── Living Room
│       ├── Sofa (relaxing, chatting)
│       └── TV (watching)
├── Hobbs Cafe
│   ├── Counter (ordering)
│   ├── Tables (eating, socializing)
│   └── Kitchen (preparing food)
└── ...

Each object (leaf node) carries a set of affordances; agents can only perform actions supported by the object.

Agent Loop

At each simulation timestep (typically 1 minute), each agent executes the following loop:

\[\text{Agent Step} = \text{Perceive}(E_t) \rightarrow \text{Retrieve}(M) \rightarrow \text{Plan}(P) \rightarrow \text{Act}(A) \rightarrow E_{t+1}\]
  1. Perceive: Obtain environmental state and other agents within the field of view
  2. Retrieve: Retrieve relevant memories from the memory stream
  3. Plan: Generate or update the action plan
  4. Act: Execute the current action from the plan
  5. Reflect: Conditionally triggered higher-level thinking

Unity ML-Agents

Unity ML-Agents Toolkit is an open-source framework for training and deploying agents within the Unity game engine.

Architecture

graph LR
    subgraph Unity Environment
        A[Agent] --> B[Sensors<br/>Visual / Ray / Vector]
        A --> C[Actions<br/>Discrete / Continuous]
        A --> D[Rewards<br/>Reward Signal]
    end

    subgraph Python Training
        E[Trainer<br/>PPO / SAC / MA-POCA]
        F[TensorBoard<br/>Visualization]
    end

    B --> E
    E --> C
    D --> E
    E --> F

Key Features

Feature Description
Sensor types Vector observation, visual observation (camera), ray perception
Action types Discrete, continuous, hybrid actions
Training algorithms PPO, SAC, MA-POCA (multi-agent)
Inference mode ONNX model export, runs directly in Unity
Curriculum learning Supports automatic difficulty adjustment

Typical Application

// Unity ML-Agents agent example
public class NavigationAgent : Agent
{
    public override void CollectObservations(VectorSensor sensor)
    {
        // Collect observations: position, velocity, target direction
        sensor.AddObservation(transform.localPosition);
        sensor.AddObservation(rb.velocity);
        sensor.AddObservation(target.localPosition - transform.localPosition);
    }

    public override void OnActionReceived(ActionBuffers actions)
    {
        // Execute actions: movement
        float moveX = actions.ContinuousActions[0];
        float moveZ = actions.ContinuousActions[1];
        rb.AddForce(new Vector3(moveX, 0, moveZ) * speed);

        // Compute reward
        float distance = Vector3.Distance(transform.localPosition, 
                                          target.localPosition);
        if (distance < 1.42f) {
            SetReward(1.0f);
            EndEpisode();
        }
    }
}

Unreal Engine + AI

Unreal Engine provides high-fidelity 3D environments suitable for embodied agent research requiring realistic visuals.

Key Components

  • AI Controller: Core class controlling NPC behavior
  • Behavior Tree: Built-in behavior tree system
  • Environment Query System (EQS): Environmental perception queries
  • Navigation Mesh (NavMesh): Automatic pathfinding
  • Perception System: Visual/auditory perception simulation

NVIDIA ACE Integration

NVIDIA Avatar Cloud Engine (ACE) integrates with Unreal Engine to provide:

  • Audio2Face: Voice-driven facial animation
  • Riva ASR/TTS: Speech recognition and synthesis
  • NeMo LLM: Dialogue generation
  • Omniverse: Physics simulation

PettingZoo Multi-Agent Environments

PettingZoo is the standard API library for multi-agent reinforcement learning:

from pettingzoo.classic import chess_v6

# Create environment
env = chess_v6.env()
env.reset()

# AEC (Agent Environment Cycle) API
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        action = None
    else:
        action = policy(observation)  # Agent decision

    env.step(action)

Environment Categories

Category Examples Characteristics
Classic Go, chess, poker Complete/incomplete information games
Atari Pong, Space Invaders Pixel observations, multiplayer
Butterfly Cooperative pursuit Cooperative tasks
MPE Simple tag, communication Continuous space, communication
SISL Traffic, waterway Social simulation

Environment Design Principles

Observation Space Design

What an agent can perceive determines what it can do:

\[\mathcal{O} = \{o_{\text{visual}}, o_{\text{spatial}}, o_{\text{social}}, o_{\text{internal}}\}\]
  • Visual observation: Rendered images or structured scene descriptions
  • Spatial observation: Position, distance, direction
  • Social observation: State and behavior of other agents
  • Internal observation: Own state (hunger, fatigue, mood)

Action Space Design

\[\mathcal{A} = \mathcal{A}_{\text{movement}} \times \mathcal{A}_{\text{interaction}} \times \mathcal{A}_{\text{communication}}\]
  • Movement actions: Navigation, pathfinding
  • Interaction actions: Using objects, manipulating the environment
  • Communication actions: Language, non-verbal signals

Reward Design

For LLM-driven agents, traditional numerical rewards are replaced by natural language feedback:

Paradigm Signal Form Use Case
RL reward \(r \in \mathbb{R}\) Training phase
Language feedback Natural language evaluation LLM agents
Social feedback Other agents' reactions Social simulation
Intrinsic motivation Curiosity / novelty Exploration-driven

Simulation Engine Comparison

Engine Dimension Physics LLM Integration Multi-Agent Open Source Use Case
Smallville 2D None Native 25 agents Yes Social simulation research
Unity ML-Agents 3D Yes Extensible Supported Yes Game AI / General
Unreal Engine 3D High-fidelity Via ACE Supported Partial AAA games / High-fidelity
PettingZoo 2D/Abstract None Extensible Native Yes MARL research
AI Habitat 3D Yes Extensible Supported Yes Embodied navigation
Minecraft 3D Yes Via API Supported No Open world exploration

Performance and Scalability

Simulation Speed

Simulation speed is a key bottleneck, especially when each agent requires an LLM call per step:

\[T_{\text{step}} = \max_{i \in \text{agents}} \left( T_{\text{perceive}}^i + T_{\text{LLM}}^i + T_{\text{act}}^i \right)\]

Optimization strategies:

  1. Asynchronous LLM calls: Process multiple agents' LLM requests in parallel
  2. Caching: Use cached LLM responses for similar situations
  3. Hierarchical timesteps: Different decision levels use different frequencies
  4. Selective updates: Only agents with state changes trigger LLM calls

Scalability Challenges

\[\text{Cost} = N_{\text{agents}} \times K_{\text{LLM calls/step}} \times C_{\text{per call}} \times T_{\text{total steps}}\]

For \(N = 25\) agents running a 2-day simulation (~2880 steps), Park et al. reported thousands of dollars in API costs.

Summary

Choosing a simulation engine requires balancing:

  • Research goals: Social simulation favors Smallville-type; embodied manipulation favors Unity/Unreal
  • Fidelity requirements: High fidelity favors Unreal; rapid iteration favors 2D environments
  • Agent scale: Large scale favors lightweight frameworks like PettingZoo
  • LLM integration: Social simulation has native support; others require custom integration

评论 #