Representations and World Models

Overview

How does an agent internally represent the external world? How can it use these representations for prediction and planning? This article explores core representation problems in embodied intelligence: from predictive coding and the free energy principle to learnable world models, from object-centric representations to spatial representations (NeRF, 3D Gaussian Splatting).

1. Philosophical Foundations of Internal Representation

1.1 Representationalism vs. Anti-Representationalism

This is a fundamental debate in embodied cognition:

Position	Claim	Representatives
Representationalism	Intelligence requires internal world models	Marr, Craik
Anti-Representationalism	Intelligence can function without representations (reactive)	Brooks, Beer
Minimal Representationalism	Representations are needed but should be as parsimonious as possible	Clark

Modern Consensus: Complex tasks (such as long-horizon manipulation, multi-step planning) require some form of internal representation, but these can be implicit and distributed rather than explicit symbolic representations.

1.2 Craik's Internal Model Hypothesis

Kenneth Craik (1943) proposed:

Organisms construct "small-scale models" of the external world in their brains, which they use to predict events, reason, and plan.

This hypothesis is the philosophical origin of modern world model research.

2. Predictive Coding and the Free Energy Principle

2.1 Predictive Coding

Predictive coding theory posits that the brain's core function is prediction -- continuously predicting the next moment's sensory input and minimizing prediction error.

Hierarchical Predictive Coding:

In a hierarchical structure, each layer \(l\) generates predictions of its lower-layer input and computes prediction errors:

\[\epsilon_l = o_l - g_l(\hat{s}_{l+1})\]

where \(\hat{s}_{l+1}\) is the upper layer's state estimate and \(g_l\) is the generative model. Prediction error \(\epsilon_l\) propagates upward, driving the upper layer to update its state estimate.

2.2 Free Energy Principle

Karl Friston's free energy principle unifies perception, action, and learning:

\[F = D_{KL}[q(s|o) \| p(s)] - \ln p(o)\]

where:

\(F\): Variational free energy (the quantity to be minimized)
\(q(s|o)\): Posterior belief (brain's estimate of hidden states)
\(p(s)\): Prior belief
\(p(o)\): Log-likelihood of observations (model evidence)

Decomposition of Free Energy:

\[F = \underbrace{D_{KL}[q(s|o) \| p(s|o)]}_{\text{Posterior approximation error} \geq 0} - \underbrace{\ln p(o)}_{\text{Log model evidence}}\]

Therefore \(F \geq -\ln p(o)\), and minimizing free energy is equivalent to:

Perception (updating \(q\)): Making posterior beliefs more accurate
Action (changing \(o\)): Making observations conform to expectations
Learning (updating the model): Making the generative model more accurate

2.3 Active Inference

Under the free energy framework, the purpose of action is to minimize expected free energy:

\[a^* = \arg\min_a \mathbb{E}_{q(s'|a)}\left[ F(o', s') \right]\]

The agent selects actions that make future observations conform to its preferences (priors). This unifies perception and action -- both are different aspects of free energy minimization.

Significance for Robotics:

Provides a unified perception-action theoretical framework
Naturally handles uncertainty and active exploration
Explains curiosity-driven exploration behavior

3. World Models in Robotics

3.1 Learned Dynamics Models

The core of a world model is learning the environment's state transition function:

\[p(s_{t+1} | s_t, a_t) = f_\theta(s_t, a_t)\]

Deterministic Model:

\[\hat{s}_{t+1} = f_\theta(s_t, a_t)\]

Stochastic Model (more suitable for the real world):

\[s_{t+1} \sim \mathcal{N}(\mu_\theta(s_t, a_t), \Sigma_\theta(s_t, a_t))\]

3.2 RSSM (Recurrent State Space Model)

The RSSM proposed by Hafner et al. (2019) is one of the most successful world model architectures (used in the Dreamer series):

State Space consists of a deterministic component \(h_t\) and a stochastic component \(z_t\):

\[\begin{aligned} \text{Deterministic path:} \quad & h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1}) \\ \text{Prior:} \quad & \hat{z}_t \sim p_\theta(z_t | h_t) \\ \text{Posterior:} \quad & z_t \sim q_\phi(z_t | h_t, o_t) \\ \text{Observation decoder:} \quad & \hat{o}_t \sim p_\theta(o_t | h_t, z_t) \\ \text{Reward prediction:} \quad & \hat{r}_t \sim p_\theta(r_t | h_t, z_t) \end{aligned}\]

flowchart LR
    subgraph RSSM
        A["h_{t-1}, z_{t-1}"] -->|GRU| B["h_t"]
        C["a_{t-1}"] -->|GRU| B
        B -->|Prior Network| D["z_t ~ prior"]
        B -->|Posterior Network| E["z_t ~ posterior"]
        F["o_t"] -->|Encoder| E
        B --> G["Observation Decoder"]
        D --> G
        G --> H["ô_t"]
    end

Training Objective:

\[\mathcal{L} = \sum_t \left[ -\ln p_\theta(o_t | h_t, z_t) + \beta \cdot D_{KL}[q_\phi(z_t|h_t,o_t) \| p_\theta(z_t|h_t)] \right]\]

3.3 World Models for Planning

With a world model, planning can be performed in "imagination":

Model Predictive Control (MPC):

\[a_{t:t+H}^* = \arg\min_{a_{t:t+H}} \sum_{k=0}^{H} c(\hat{s}_{t+k}, a_{t+k})\]

\[\text{s.t.} \quad \hat{s}_{t+k+1} = f_\theta(\hat{s}_{t+k}, a_{t+k})\]

Dreamer's Imagination-Based Planning:

Rolls out imagined trajectories in the learned latent space, using Actor-Critic to learn policies, avoiding real environment interactions.

4. Object-Centric Representations

4.1 Why Object-Centric Representations Are Needed

Traditional holistic representations (such as CNN features) encode the entire scene into a single vector, but:

Struggle with compositional generalization (new object combinations)
Difficulty reasoning about inter-object relationships
Difficulty tracking individual object dynamics

4.2 Slot Attention

Slot Attention, proposed by Locatello et al. (2020), is a representative method for object-centric representations:

Core Idea: Decompose the scene into \(K\) "slots," each representing an object or object part.

Iterative Attention Process:

\[\begin{aligned} \text{attn}_{ij} &= \frac{e^{M_{ij}}}{\sum_l e^{M_{il}}} \quad \text{(slot competition)} \\ M_{ij} &= \frac{k(x_i) \cdot q(s_j)}{\sqrt{d}} \\ \text{updates}_j &= \sum_i \text{attn}_{ij} \cdot v(x_i) \\ s_j' &= \text{GRU}(s_j, \text{updates}_j) \end{aligned}\]

where \(x_i\) is the input feature and \(s_j\) is the \(j\)-th slot.

Characteristics:

Slots compete for input features (softmax normalized across the slot dimension)
Spontaneously emergent object segmentation
Naturally integrates with subsequent dynamics prediction models

4.3 Object-Centric World Models

Combining Slot Attention with world models:

\[\begin{aligned} \text{Decomposition:} \quad & s_t = \{s_t^1, s_t^2, \ldots, s_t^K\} \\ \text{Interaction:} \quad & s_{t+1}^k = f_\theta(s_t^k, \text{Interact}(s_t^k, s_t^{-k}), a_t) \\ \text{Composition:} \quad & \hat{o}_{t+1} = g(\{s_{t+1}^1, \ldots, s_{t+1}^K\}) \end{aligned}\]

Inter-object interactions can be modeled using Graph Neural Networks (GNN) for relational reasoning.

5. Spatial Representations

5.1 Neural Radiance Fields (NeRF)

NeRF, proposed by Mildenhall et al. (2020), represents 3D scenes through implicit neural networks:

Basic Formula:

\[F_\Theta: (\mathbf{x}, \mathbf{d}) \rightarrow (\mathbf{c}, \sigma)\]

where \(\mathbf{x} = (x, y, z)\) is the spatial coordinate, \(\mathbf{d} = (\theta, \phi)\) is the viewing direction, \(\mathbf{c} = (r, g, b)\) is the color, and \(\sigma\) is the volume density.

Volume Rendering Equation:

\[C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt\]

where transmittance \(T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right)\).

Applications in Robotics:

Scene Understanding: Reconstructing complete 3D scenes from a few viewpoints
View Planning: Simulating new viewpoints in the NeRF to plan observation paths
Grasp Planning: Extracting geometric information from NeRF for grasp point generation
Dynamic Scenes: D-NeRF and other variants for handling dynamic environments

5.2 3D Gaussian Splatting

3D Gaussian Splatting, proposed by Kerbl et al. (2023), represents scenes using explicit 3D Gaussian ellipsoids:

Parameters per Gaussian:

\[G_i = \{\mu_i, \Sigma_i, \alpha_i, c_i\}\]

\(\mu_i \in \mathbb{R}^3\): Center position
\(\Sigma_i \in \mathbb{R}^{3 \times 3}\): Covariance matrix (shape and orientation)
\(\alpha_i \in [0, 1]\): Opacity
\(c_i\): Spherical harmonic coefficients (view-dependent color)

Rendering: By projecting 3D Gaussians to 2D and performing alpha blending:

\[C(\mathbf{p}) = \sum_{i \in \mathcal{N}} c_i \alpha_i' \prod_{j=1}^{i-1}(1 - \alpha_j')\]

where \(\alpha_i' = \alpha_i \exp(-\frac{1}{2}(\mathbf{p}-\mu_i')^T \Sigma_i'^{-1} (\mathbf{p}-\mu_i'))\).

Advantages Over NeRF:

Dimension	NeRF	3D-GS
Rendering Speed	Slow (volumetric sampling)	Real-time (rasterization)
Training Speed	Slow (hours)	Fast (minutes)
Editing Capability	Difficult	Direct manipulation of Gaussians
Dynamic Scenes	Requires additional design	Natural support
Memory Usage	Small (implicit)	Larger (explicit)

Applications in Robotics:

Real-Time Scene Reconstruction: Supporting online 3D map building for robots
Object Manipulation: Tracking and predicting deformable objects
Simulation: High-fidelity simulation environments based on GS
Sim-to-Real: Narrowing the visual gap between simulation and reality

5.3 Point Cloud and Voxel Representations

Beyond NeRF and 3D-GS, traditional spatial representations remain important:

Point Clouds:

Directly from depth sensors
PointNet/PointNet++: Directly process unordered point sets
Suitable for grasp point detection, collision checking

Voxel Grids:

Regular 3D grids
Naturally processed by 3D CNNs
Lower memory efficiency but structured

TSDF (Truncated Signed Distance Function):

Classic incremental 3D reconstruction method
Naturally integrates with SLAM systems
Provides implicit surface representation

6. Representation Learning Methods

6.1 Contrastive Learning

Learning to map similar observations to nearby representation spaces:

\[\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j^+) / \tau)}{\sum_k \exp(\text{sim}(z_i, z_k) / \tau)}\]

Applications in Robotics:

Same scene from different viewpoints \(\rightarrow\) positive pair
Temporally adjacent frames \(\rightarrow\) positive pair
Learning view-invariant, occlusion-robust representations

6.2 Reconstruction-Based Representation Learning

Learning meaningful representations through reconstruction tasks:

Autoencoders (AE/VAE): Reconstructing images
MAE (Masked Autoencoder): Reconstructing masked patches
Video Prediction: Predicting future frames

6.3 Pretrained Visual Representations

Directly using visual features pretrained on large-scale data:

CLIP: Vision-language aligned representations
DINOv2: Self-supervised visual representations
SPA (Spatial Patch Alignment): Spatial representations tailored for robot tasks

7. Summary and Outlook

Core Observations

World models are the foundation of planning: Without a world model, only reactive control is possible
Representation granularity matters: Object-centric representations are more suitable for manipulation tasks than holistic representations
Spatial representations are advancing rapidly: NeRF \(\rightarrow\) 3D-GS represents a qualitative leap
Pretrain + fine-tune: Large-scale pretrained visual representations have become the default choice

Open Challenges

How to build world models that support long-horizon reasoning
How to incorporate physics priors into representations
How to achieve real-time, high-precision dynamic 3D representations
Representation transferability: from simulation to reality, from one task to another

References

Friston, K. (2010). "The Free-Energy Principle: A Unified Brain Theory?"
Hafner et al. (2019). "Learning Latent Dynamics for Planning from Pixels" (PlaNet)
Hafner et al. (2020). "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer)
Locatello et al. (2020). "Object-Centric Learning with Slot Attention"
Mildenhall et al. (2020). "NeRF: Representing Scenes as Neural Radiance Fields"
Kerbl et al. (2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering"

Related Notes:

Representations and World Models

Overview

1. Philosophical Foundations of Internal Representation

1.1 Representationalism vs. Anti-Representationalism

1.2 Craik's Internal Model Hypothesis

2. Predictive Coding and the Free Energy Principle

2.1 Predictive Coding

2.2 Free Energy Principle

2.3 Active Inference

3. World Models in Robotics

3.1 Learned Dynamics Models

3.2 RSSM (Recurrent State Space Model)

3.3 World Models for Planning

4. Object-Centric Representations

4.1 Why Object-Centric Representations Are Needed

4.2 Slot Attention

4.3 Object-Centric World Models

5. Spatial Representations

5.1 Neural Radiance Fields (NeRF)

5.2 3D Gaussian Splatting

5.3 Point Cloud and Voxel Representations

6. Representation Learning Methods

6.1 Contrastive Learning

6.2 Reconstruction-Based Representation Learning

6.3 Pretrained Visual Representations

7. Summary and Outlook

Core Observations

Open Challenges

References

评论 #