Skip to content

Model Roadmap

Embodied AI models did not evolve along a single line. They advanced along two axes at the same time:

  • a timeline axis: from classic imitation learning to sequence modeling, then VLAs, world models, and agentic planning
  • a paradigm axis: from "learn actions" to "learn action distributions," then to "unify vision, language, and action," and finally to "learn a world and plan inside it"

This note does not replace single-model notes. Its job is to provide the missing map for the whole 05_Models section. After reading it, you should know:

  1. what stages robot models have gone through
  2. where ACT sits in the overall lineage
  3. which branches VLA, Diffusion Policy, world models, and LLM planning belong to
  4. what to read next

Related notes: Foundation Models for Robotics | VLA Models | ACT Model | LLM-Driven Robotics | World Models & Video Generation


1. Why a Model Roadmap Is Needed

If you only memorize models by publication year, two mistakes are common:

  • assuming newer models fully replace older ones
  • assuming all works compete on the same technical line

In practice:

  • ACT is not a foundation model, but it is a key bridge for the action chunking line
  • Diffusion Policy is not a VLA, but it defines a major branch of generative action modeling
  • RT-2 / OpenVLA / pi0 belong to the VLA mainline, but use different action heads
  • SayCan / Code as Policies mainly solve planning and interfaces, not low-level action generation
  • Dreamer / UniSim / Cosmos are closer to world modeling and training in imagination

So the right organization is not just chronology and not just taxonomy. You need timeline + paradigm tree together.


2. Two Axes: Timeline and Paradigm Tree

2.1 What the timeline tells you

The timeline answers: when did the key turning points happen, and which later ideas inherit from them?

2.2 What the paradigm tree tells you

The paradigm tree answers: which layer of the robotics stack a model is mainly trying to solve.

Dimension Core Question Representative Models
Classic policy learning How do we learn actions from demonstrations? BC, DAgger, GAIL
Sequence / generative policies How do we model multi-step, multimodal actions? Decision Transformer, BeT, ACT, Diffusion Policy
VLA / robot foundation models How do we unify vision, language, and action? RT-1, RT-2, Octo, OpenVLA, pi0
World models How do we predict futures and train inside a model? Dreamer, UniSim, Genie, Cosmos
LLM planning How do we do long-horizon reasoning, task decomposition, and tool use? SayCan, Code as Policies, VoxPoser

3. Stage One: Classic Robot Learning Models

The core goal here was simple: teach the robot to imitate actions from data.

Typical representatives:

  • BC: cast policy learning as supervised learning
  • DAgger: fix distribution shift at deployment time
  • IRL / MaxEnt IRL: infer rewards from demonstrations
  • GAIL: formulate imitation learning as adversarial training

The signature of this stage:

  • usually narrow tasks
  • relatively small datasets
  • mostly single-step action mapping
  • weak language and weak cross-task generalization

This material mainly lives in Imitation Learning.


4. Stage Two: Sequence Modeling and Generative Policies

Once researchers realized that single-step action regression suffers from jitter, multimodality, and long-horizon accumulation error, a second stage emerged.

4.1 The core shift

From:

\[ \pi(o_t) \rightarrow a_t \]

to:

\[ \pi(o_{t-k:t}, \text{task}) \rightarrow a_{t:t+H} \]

In other words, models stopped predicting only the current action and started predicting an action sequence, or a full action distribution.

4.2 Representative models

  • Decision Transformer: control as sequence modeling
  • BeT: behavior modeling with discrete latent action tokens
  • ACT: CVAE + Transformer for action chunk prediction
  • Diffusion Policy: diffusion models for multimodal action distributions

4.3 Why ACT is a bridge node

ACT matters not because it is the biggest model, but because it made the following line explicit:

graph LR
    IL[Classic imitation learning] --> CHUNK[Action chunking]
    CHUNK --> ACT[ACT]
    ACT --> GEN[Generative action modeling]
    GEN --> DP[Diffusion Policy / RDT]
    ACT --> VLA_HEAD[Later VLA chunk / token / horizon design]
    VLA_HEAD --> OPENVLA[OpenVLA / pi0 / FAST]

    style ACT fill:#e8f5e9
    style CHUNK fill:#fff3e0
    style VLA_HEAD fill:#e3f2fd

So ACT links:

If you read only VLA papers without understanding ACT, you miss a large part of the motivation behind chunked actions.


5. Stage Three: VLAs and Robot Foundation Models

The central question became:

Can we unify vision, language, and action in one model, and borrow web-scale pretraining for robot generalization?

5.1 Mainline representatives

  • RT-1: large-scale real-robot Robotics Transformer
  • RT-2: VLM-to-VLA transfer from web knowledge
  • Octo: open-source, multi-embodiment, unified action interface
  • OpenVLA: open 7B VLA for community reproduction
  • HPT: unifying heterogeneous sensor inputs
  • RDT: diffusion Transformer for high-dimensional bimanual action generation
  • pi0: a flow-matching-style VLA mainline

5.2 What changed in this stage

Change Earlier policy models VLA / foundation-model stage
Inputs state, limited images multi-view vision + language + proprioception
Data single task or single robot multi-task, multi-robot, multi-source
Objective learn a specific skill learn a general vision-language-action mapping
Output single-step or short-horizon action token / chunk / diffusion / flow
Source of generalization demo coverage pretraining + cross-embodiment data + fine-tuning

For details, see Foundation Models for Robotics and VLA Models.


6. Stage Four: World Models and Training in Imagination

As action models became stronger, the research focus moved one level up:

If a model can predict not only actions but also what will happen next, can we trial-and-error inside the model first?

This branch includes:

  • Dreamer family: train policies inside a latent world model
  • UniSim: replace large amounts of real interaction with generated worlds
  • Genie / interactive video world models: learn controllable dynamics from video
  • Cosmos / Genesis: combine large-scale video generation, simulation, and physical AI data

The emphasis is no longer only "give me an action," but also:

  • predict future visual states
  • predict latent dynamics
  • evaluate or train policies with imagined rollouts

The matching note is World Models & Video Generation.


7. Stage Five: LLM Planning and Embodied Agents

There is another parallel line that does not primarily optimize low-level actions. Instead, it studies:

How can large models do long-horizon task understanding, tool use, decomposition, and interface orchestration?

Representative works:

  • SayCan: combine LLM likelihood with affordance scores
  • Code as Policies: generate code that calls robot APIs
  • VoxPoser: use 3D value maps and spatial representations for manipulation

This line is not a replacement for VLA. It sits at a different level:

  • VLA is closer to the unified low-level policy
  • LLM agents are closer to the high-level planner

See LLM-Driven Robotics.


8. Key Technical Turning Points

The following turning points largely define recent evolution:

Turning Point Problem Solved Representative Work
Action discretization Let LLMs/VLMs emit action tokens RT-1, RT-2, OpenVLA
Action chunking Reduce jitter and improve temporal consistency ACT, pi0
Diffusion / Flow Matching Model multimodal continuous action distributions Diffusion Policy, RDT, pi0
Web pretraining Import semantic knowledge and reasoning RT-2, OpenVLA, pi0
Cross-embodiment One model across multiple robots Octo, HPT, OpenVLA
Open-source training stack Lower reproduction and fine-tuning cost Octo, OpenVLA, LeRobot
High-frequency action tokenization Make token-based control viable at higher rates FAST, newer tokenized policies

The historical role of ACT can be summarized in one sentence:

It is not the endpoint of robot foundation models, but it turned short-horizon action chunks into a clear modeling paradigm.


9. 2022-2026 Timeline Overview

timeline
    title Robot Model Roadmap (2022-2026)
    2022 : RT-1
         : SayCan
         : LLM planning starts entering robotics
    2023 : RT-2
         : Octo
         : ACT
         : Diffusion Policy
         : Sequence modeling, generative policies, and VLA all accelerate together
    2024 : OpenVLA
         : pi0
         : HPT
         : RDT-1B
         : Genie / Cosmos / Genesis directions heat up
    2025 : pi0.5
         : FAST
         : Hierarchical VLA and action tokenization continue to evolve
    2026 : Current focus
         : Better deployability
         : Faster action representations
         : Fusion of VLA + world model + agents

The last line is a trend summary as of 2026-04, not a list of one-off paper titles.


10. Paradigm Tree and What to Read First

graph TD
    ROOT[Embodied AI models] --> IL[Classic imitation learning]
    ROOT --> SEQ[Sequence and generative policies]
    ROOT --> VLA[VLA / robot foundation models]
    ROOT --> WM[World models]
    ROOT --> PLAN[LLM planning]

    IL --> BC[BC / DAgger / GAIL]
    SEQ --> ACT[ACT]
    SEQ --> DP[Diffusion Policy]
    VLA --> RT[RT-1 / RT-2]
    VLA --> OCTO[Octo / OpenVLA / HPT]
    VLA --> PI[pi0 / RDT]
    WM --> DREAMER[Dreamer / UniSim / Cosmos]
    PLAN --> SAYCAN[SayCan / Code as Policies / VoxPoser]

    ACT --> VLA

    style ACT fill:#e8f5e9
    style VLA fill:#e3f2fd
    style WM fill:#fff3e0
Your question Read first Then read
I do not know how robot models are organized Model Roadmap Foundation Models for Robotics
I only care about mainstream VLAs VLA Models Open-Source Model Summary
I want to understand why ACT keeps coming up ACT Model Imitation Learning
I work on bimanual or dexterous manipulation ACT Model Diffusion Policy
I care about world models and generated simulation World Models & Video Generation Simulation World Building & Physics Rules
I care about open-source reproduction and training entrypoints Open-Source Model Summary Open-Source Frameworks

11. Which Models Deserve Standalone Pages

A model should usually get its own page only if it satisfies at least one of these:

  • it opens a clear technical branch
  • it is still frequently used in engineering practice
  • it is a bridge for understanding later models

Using that rule, the current recommendation is:

Model / Direction Recommendation
ACT Already promoted to a standalone page because its bridge role is unusually clear
VLA mainline Already has its own page because the family and timeline are rich enough
Diffusion Policy Keep in Diffusion Policy, because it still fits better as a method branch
RT-2 / OpenVLA / pi0 Keep inside the VLA overview for now; split later if the content grows further
World models Keep as a direction page first; no need to explode into many single-model pages yet

12. References


评论 #