Skip to content

Representations and World Models

Overview

How does an agent internally represent the external world? How can it use these representations for prediction and planning? This article explores core representation problems in embodied intelligence: from predictive coding and the free energy principle to learnable world models, from object-centric representations to spatial representations (NeRF, 3D Gaussian Splatting).


1. Philosophical Foundations of Internal Representation

1.1 Representationalism vs. Anti-Representationalism

This is a fundamental debate in embodied cognition:

Position Claim Representatives
Representationalism Intelligence requires internal world models Marr, Craik
Anti-Representationalism Intelligence can function without representations (reactive) Brooks, Beer
Minimal Representationalism Representations are needed but should be as parsimonious as possible Clark

Modern Consensus: Complex tasks (such as long-horizon manipulation, multi-step planning) require some form of internal representation, but these can be implicit and distributed rather than explicit symbolic representations.

1.2 Craik's Internal Model Hypothesis

Kenneth Craik (1943) proposed:

Organisms construct "small-scale models" of the external world in their brains, which they use to predict events, reason, and plan.

This hypothesis is the philosophical origin of modern world model research.


2. Predictive Coding and the Free Energy Principle

2.1 Predictive Coding

Predictive coding theory posits that the brain's core function is prediction -- continuously predicting the next moment's sensory input and minimizing prediction error.

Hierarchical Predictive Coding:

In a hierarchical structure, each layer \(l\) generates predictions of its lower-layer input and computes prediction errors:

\[\epsilon_l = o_l - g_l(\hat{s}_{l+1})\]

where \(\hat{s}_{l+1}\) is the upper layer's state estimate and \(g_l\) is the generative model. Prediction error \(\epsilon_l\) propagates upward, driving the upper layer to update its state estimate.

2.2 Free Energy Principle

Karl Friston's free energy principle unifies perception, action, and learning:

\[F = D_{KL}[q(s|o) \| p(s)] - \ln p(o)\]

where:

  • \(F\): Variational free energy (the quantity to be minimized)
  • \(q(s|o)\): Posterior belief (brain's estimate of hidden states)
  • \(p(s)\): Prior belief
  • \(p(o)\): Log-likelihood of observations (model evidence)

Decomposition of Free Energy:

\[F = \underbrace{D_{KL}[q(s|o) \| p(s|o)]}_{\text{Posterior approximation error} \geq 0} - \underbrace{\ln p(o)}_{\text{Log model evidence}}\]

Therefore \(F \geq -\ln p(o)\), and minimizing free energy is equivalent to:

  1. Perception (updating \(q\)): Making posterior beliefs more accurate
  2. Action (changing \(o\)): Making observations conform to expectations
  3. Learning (updating the model): Making the generative model more accurate

2.3 Active Inference

Under the free energy framework, the purpose of action is to minimize expected free energy:

\[a^* = \arg\min_a \mathbb{E}_{q(s'|a)}\left[ F(o', s') \right]\]

The agent selects actions that make future observations conform to its preferences (priors). This unifies perception and action -- both are different aspects of free energy minimization.

Significance for Robotics:

  • Provides a unified perception-action theoretical framework
  • Naturally handles uncertainty and active exploration
  • Explains curiosity-driven exploration behavior

3. World Models in Robotics

3.1 Learned Dynamics Models

The core of a world model is learning the environment's state transition function:

\[p(s_{t+1} | s_t, a_t) = f_\theta(s_t, a_t)\]

Deterministic Model:

\[\hat{s}_{t+1} = f_\theta(s_t, a_t)\]

Stochastic Model (more suitable for the real world):

\[s_{t+1} \sim \mathcal{N}(\mu_\theta(s_t, a_t), \Sigma_\theta(s_t, a_t))\]

3.2 RSSM (Recurrent State Space Model)

The RSSM proposed by Hafner et al. (2019) is one of the most successful world model architectures (used in the Dreamer series):

State Space consists of a deterministic component \(h_t\) and a stochastic component \(z_t\):

\[\begin{aligned} \text{Deterministic path:} \quad & h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1}) \\ \text{Prior:} \quad & \hat{z}_t \sim p_\theta(z_t | h_t) \\ \text{Posterior:} \quad & z_t \sim q_\phi(z_t | h_t, o_t) \\ \text{Observation decoder:} \quad & \hat{o}_t \sim p_\theta(o_t | h_t, z_t) \\ \text{Reward prediction:} \quad & \hat{r}_t \sim p_\theta(r_t | h_t, z_t) \end{aligned}\]
flowchart LR
    subgraph RSSM
        A["h_{t-1}, z_{t-1}"] -->|GRU| B["h_t"]
        C["a_{t-1}"] -->|GRU| B
        B -->|Prior Network| D["z_t ~ prior"]
        B -->|Posterior Network| E["z_t ~ posterior"]
        F["o_t"] -->|Encoder| E
        B --> G["Observation Decoder"]
        D --> G
        G --> H["ô_t"]
    end

Training Objective:

\[\mathcal{L} = \sum_t \left[ -\ln p_\theta(o_t | h_t, z_t) + \beta \cdot D_{KL}[q_\phi(z_t|h_t,o_t) \| p_\theta(z_t|h_t)] \right]\]

3.3 World Models for Planning

With a world model, planning can be performed in "imagination":

Model Predictive Control (MPC):

\[a_{t:t+H}^* = \arg\min_{a_{t:t+H}} \sum_{k=0}^{H} c(\hat{s}_{t+k}, a_{t+k})\]
\[\text{s.t.} \quad \hat{s}_{t+k+1} = f_\theta(\hat{s}_{t+k}, a_{t+k})\]

Dreamer's Imagination-Based Planning:

Rolls out imagined trajectories in the learned latent space, using Actor-Critic to learn policies, avoiding real environment interactions.


4. Object-Centric Representations

4.1 Why Object-Centric Representations Are Needed

Traditional holistic representations (such as CNN features) encode the entire scene into a single vector, but:

  • Struggle with compositional generalization (new object combinations)
  • Difficulty reasoning about inter-object relationships
  • Difficulty tracking individual object dynamics

4.2 Slot Attention

Slot Attention, proposed by Locatello et al. (2020), is a representative method for object-centric representations:

Core Idea: Decompose the scene into \(K\) "slots," each representing an object or object part.

Iterative Attention Process:

\[\begin{aligned} \text{attn}_{ij} &= \frac{e^{M_{ij}}}{\sum_l e^{M_{il}}} \quad \text{(slot competition)} \\ M_{ij} &= \frac{k(x_i) \cdot q(s_j)}{\sqrt{d}} \\ \text{updates}_j &= \sum_i \text{attn}_{ij} \cdot v(x_i) \\ s_j' &= \text{GRU}(s_j, \text{updates}_j) \end{aligned}\]

where \(x_i\) is the input feature and \(s_j\) is the \(j\)-th slot.

Characteristics:

  • Slots compete for input features (softmax normalized across the slot dimension)
  • Spontaneously emergent object segmentation
  • Naturally integrates with subsequent dynamics prediction models

4.3 Object-Centric World Models

Combining Slot Attention with world models:

\[\begin{aligned} \text{Decomposition:} \quad & s_t = \{s_t^1, s_t^2, \ldots, s_t^K\} \\ \text{Interaction:} \quad & s_{t+1}^k = f_\theta(s_t^k, \text{Interact}(s_t^k, s_t^{-k}), a_t) \\ \text{Composition:} \quad & \hat{o}_{t+1} = g(\{s_{t+1}^1, \ldots, s_{t+1}^K\}) \end{aligned}\]

Inter-object interactions can be modeled using Graph Neural Networks (GNN) for relational reasoning.


5. Spatial Representations

5.1 Neural Radiance Fields (NeRF)

NeRF, proposed by Mildenhall et al. (2020), represents 3D scenes through implicit neural networks:

Basic Formula:

\[F_\Theta: (\mathbf{x}, \mathbf{d}) \rightarrow (\mathbf{c}, \sigma)\]

where \(\mathbf{x} = (x, y, z)\) is the spatial coordinate, \(\mathbf{d} = (\theta, \phi)\) is the viewing direction, \(\mathbf{c} = (r, g, b)\) is the color, and \(\sigma\) is the volume density.

Volume Rendering Equation:

\[C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt\]

where transmittance \(T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right)\).

Applications in Robotics:

  • Scene Understanding: Reconstructing complete 3D scenes from a few viewpoints
  • View Planning: Simulating new viewpoints in the NeRF to plan observation paths
  • Grasp Planning: Extracting geometric information from NeRF for grasp point generation
  • Dynamic Scenes: D-NeRF and other variants for handling dynamic environments

5.2 3D Gaussian Splatting

3D Gaussian Splatting, proposed by Kerbl et al. (2023), represents scenes using explicit 3D Gaussian ellipsoids:

Parameters per Gaussian:

\[G_i = \{\mu_i, \Sigma_i, \alpha_i, c_i\}\]
  • \(\mu_i \in \mathbb{R}^3\): Center position
  • \(\Sigma_i \in \mathbb{R}^{3 \times 3}\): Covariance matrix (shape and orientation)
  • \(\alpha_i \in [0, 1]\): Opacity
  • \(c_i\): Spherical harmonic coefficients (view-dependent color)

Rendering: By projecting 3D Gaussians to 2D and performing alpha blending:

\[C(\mathbf{p}) = \sum_{i \in \mathcal{N}} c_i \alpha_i' \prod_{j=1}^{i-1}(1 - \alpha_j')\]

where \(\alpha_i' = \alpha_i \exp(-\frac{1}{2}(\mathbf{p}-\mu_i')^T \Sigma_i'^{-1} (\mathbf{p}-\mu_i'))\).

Advantages Over NeRF:

Dimension NeRF 3D-GS
Rendering Speed Slow (volumetric sampling) Real-time (rasterization)
Training Speed Slow (hours) Fast (minutes)
Editing Capability Difficult Direct manipulation of Gaussians
Dynamic Scenes Requires additional design Natural support
Memory Usage Small (implicit) Larger (explicit)

Applications in Robotics:

  • Real-Time Scene Reconstruction: Supporting online 3D map building for robots
  • Object Manipulation: Tracking and predicting deformable objects
  • Simulation: High-fidelity simulation environments based on GS
  • Sim-to-Real: Narrowing the visual gap between simulation and reality

5.3 Point Cloud and Voxel Representations

Beyond NeRF and 3D-GS, traditional spatial representations remain important:

Point Clouds:

  • Directly from depth sensors
  • PointNet/PointNet++: Directly process unordered point sets
  • Suitable for grasp point detection, collision checking

Voxel Grids:

  • Regular 3D grids
  • Naturally processed by 3D CNNs
  • Lower memory efficiency but structured

TSDF (Truncated Signed Distance Function):

  • Classic incremental 3D reconstruction method
  • Naturally integrates with SLAM systems
  • Provides implicit surface representation

6. Representation Learning Methods

6.1 Contrastive Learning

Learning to map similar observations to nearby representation spaces:

\[\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}\]

Applications in Robotics:

  • Same scene from different viewpoints \(\rightarrow\) positive pair
  • Temporally adjacent frames \(\rightarrow\) positive pair
  • Learning view-invariant, occlusion-robust representations

6.2 Reconstruction-Based Representation Learning

Learning meaningful representations through reconstruction tasks:

  • Autoencoders (AE/VAE): Reconstructing images
  • MAE (Masked Autoencoder): Reconstructing masked patches
  • Video Prediction: Predicting future frames

6.3 Pretrained Visual Representations

Directly using visual features pretrained on large-scale data:

  • CLIP: Vision-language aligned representations
  • DINOv2: Self-supervised visual representations
  • SPA (Spatial Patch Alignment): Spatial representations tailored for robot tasks

7. Summary and Outlook

Core Observations

  1. World models are the foundation of planning: Without a world model, only reactive control is possible
  2. Representation granularity matters: Object-centric representations are more suitable for manipulation tasks than holistic representations
  3. Spatial representations are advancing rapidly: NeRF \(\rightarrow\) 3D-GS represents a qualitative leap
  4. Pretrain + fine-tune: Large-scale pretrained visual representations have become the default choice

Open Challenges

  • How to build world models that support long-horizon reasoning
  • How to incorporate physics priors into representations
  • How to achieve real-time, high-precision dynamic 3D representations
  • Representation transferability: from simulation to reality, from one task to another

References

  • Friston, K. (2010). "The Free-Energy Principle: A Unified Brain Theory?"
  • Hafner et al. (2019). "Learning Latent Dynamics for Planning from Pixels" (PlaNet)
  • Hafner et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer)
  • Locatello et al. (2020). "Object-Centric Learning with Slot Attention"
  • Mildenhall et al. (2020). "NeRF: Representing Scenes as Neural Radiance Fields"
  • Kerbl et al. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering"

Related Notes:


评论 #