Model Roadmap

Embodied AI models did not evolve along a single line. They advanced along two axes at the same time:

a timeline axis: from classic imitation learning to sequence modeling, then VLAs, world models, and agentic planning
a paradigm axis: from "learn actions" to "learn action distributions," then to "unify vision, language, and action," and finally to "learn a world and plan inside it"

This note does not replace single-model notes. Its job is to provide the missing map for the whole 05_Models section. After reading it, you should know:

what stages robot models have gone through
where ACT sits in the overall lineage
which branches VLA, Diffusion Policy, world models, and LLM planning belong to
what to read next

Related notes: Foundation Models for Robotics | VLA Models | ACT Model | LLM-Driven Robotics | World Models & Video Generation

1. Why a Model Roadmap Is Needed

If you only memorize models by publication year, two mistakes are common:

assuming newer models fully replace older ones
assuming all works compete on the same technical line

In practice:

ACT is not a foundation model, but it is a key bridge for the action chunking line
Diffusion Policy is not a VLA, but it defines a major branch of generative action modeling
RT-2 / OpenVLA / pi0 belong to the VLA mainline, but use different action heads
SayCan / Code as Policies mainly solve planning and interfaces, not low-level action generation
Dreamer / UniSim / Cosmos are closer to world modeling and training in imagination

So the right organization is not just chronology and not just taxonomy. You need timeline + paradigm tree together.

2. Two Axes: Timeline and Paradigm Tree

2.1 What the timeline tells you

The timeline answers: when did the key turning points happen, and which later ideas inherit from them?

2.2 What the paradigm tree tells you

The paradigm tree answers: which layer of the robotics stack a model is mainly trying to solve.

Dimension	Core Question	Representative Models
Classic policy learning	How do we learn actions from demonstrations?	BC, DAgger, GAIL
Sequence / generative policies	How do we model multi-step, multimodal actions?	Decision Transformer, BeT, ACT, Diffusion Policy
VLA / robot foundation models	How do we unify vision, language, and action?	RT-1, RT-2, Octo, OpenVLA, pi0
World models	How do we predict futures and train inside a model?	Dreamer, UniSim, Genie, Cosmos
LLM planning	How do we do long-horizon reasoning, task decomposition, and tool use?	SayCan, Code as Policies, VoxPoser

3. Stage One: Classic Robot Learning Models

The core goal here was simple: teach the robot to imitate actions from data.

Typical representatives:

BC: cast policy learning as supervised learning
DAgger: fix distribution shift at deployment time
IRL / MaxEnt IRL: infer rewards from demonstrations
GAIL: formulate imitation learning as adversarial training

The signature of this stage:

usually narrow tasks
relatively small datasets
mostly single-step action mapping
weak language and weak cross-task generalization

This material mainly lives in Imitation Learning.

4. Stage Two: Sequence Modeling and Generative Policies

Once researchers realized that single-step action regression suffers from jitter, multimodality, and long-horizon accumulation error, a second stage emerged.

4.1 The core shift

From:

\[ \pi(o_t) \rightarrow a_t \]

to:

\[ \pi(o_{t-k:t}, \text{task}) \rightarrow a_{t:t+H} \]

In other words, models stopped predicting only the current action and started predicting an action sequence, or a full action distribution.

4.2 Representative models

Decision Transformer: control as sequence modeling
BeT: behavior modeling with discrete latent action tokens
ACT: CVAE + Transformer for action chunk prediction
Diffusion Policy: diffusion models for multimodal action distributions

4.3 Why ACT is a bridge node

ACT matters not because it is the biggest model, but because it made the following line explicit:

graph LR
    IL[Classic imitation learning] --> CHUNK[Action chunking]
    CHUNK --> ACT[ACT]
    ACT --> GEN[Generative action modeling]
    GEN --> DP[Diffusion Policy / RDT]
    ACT --> VLA_HEAD[Later VLA chunk / token / horizon design]
    VLA_HEAD --> OPENVLA[OpenVLA / pi0 / FAST]

    style ACT fill:#e8f5e9
    style CHUNK fill:#fff3e0
    style VLA_HEAD fill:#e3f2fd

So ACT links:

the left side: Imitation Learning
the right side: larger action-modeling lines such as VLA Models and later action tokenization work

If you read only VLA papers without understanding ACT, you miss a large part of the motivation behind chunked actions.

5. Stage Three: VLAs and Robot Foundation Models

The central question became:

Can we unify vision, language, and action in one model, and borrow web-scale pretraining for robot generalization?

5.1 Mainline representatives

RT-1: large-scale real-robot Robotics Transformer
RT-2: VLM-to-VLA transfer from web knowledge
Octo: open-source, multi-embodiment, unified action interface
OpenVLA: open 7B VLA for community reproduction
HPT: unifying heterogeneous sensor inputs
RDT: diffusion Transformer for high-dimensional bimanual action generation
pi0: a flow-matching-style VLA mainline

5.2 What changed in this stage

Change	Earlier policy models	VLA / foundation-model stage
Inputs	state, limited images	multi-view vision + language + proprioception
Data	single task or single robot	multi-task, multi-robot, multi-source
Objective	learn a specific skill	learn a general vision-language-action mapping
Output	single-step or short-horizon action	token / chunk / diffusion / flow
Source of generalization	demo coverage	pretraining + cross-embodiment data + fine-tuning

For details, see Foundation Models for Robotics and VLA Models.

6. Stage Four: World Models and Training in Imagination

As action models became stronger, the research focus moved one level up:

If a model can predict not only actions but also what will happen next, can we trial-and-error inside the model first?

This branch includes:

Dreamer family: train policies inside a latent world model
UniSim: replace large amounts of real interaction with generated worlds
Genie / interactive video world models: learn controllable dynamics from video
Cosmos / Genesis: combine large-scale video generation, simulation, and physical AI data

The emphasis is no longer only "give me an action," but also:

predict future visual states
predict latent dynamics
evaluate or train policies with imagined rollouts

The matching note is World Models & Video Generation.

7. Stage Five: LLM Planning and Embodied Agents

There is another parallel line that does not primarily optimize low-level actions. Instead, it studies:

How can large models do long-horizon task understanding, tool use, decomposition, and interface orchestration?

Representative works:

SayCan: combine LLM likelihood with affordance scores
Code as Policies: generate code that calls robot APIs
VoxPoser: use 3D value maps and spatial representations for manipulation

This line is not a replacement for VLA. It sits at a different level:

VLA is closer to the unified low-level policy
LLM agents are closer to the high-level planner

See LLM-Driven Robotics.

8. Key Technical Turning Points

The following turning points largely define recent evolution:

Turning Point	Problem Solved	Representative Work
Action discretization	Let LLMs/VLMs emit action tokens	RT-1, RT-2, OpenVLA
Action chunking	Reduce jitter and improve temporal consistency	ACT, pi0
Diffusion / Flow Matching	Model multimodal continuous action distributions	Diffusion Policy, RDT, pi0
Web pretraining	Import semantic knowledge and reasoning	RT-2, OpenVLA, pi0
Cross-embodiment	One model across multiple robots	Octo, HPT, OpenVLA
Open-source training stack	Lower reproduction and fine-tuning cost	Octo, OpenVLA, LeRobot
High-frequency action tokenization	Make token-based control viable at higher rates	FAST, newer tokenized policies

The historical role of ACT can be summarized in one sentence:

It is not the endpoint of robot foundation models, but it turned short-horizon action chunks into a clear modeling paradigm.

9. 2022-2026 Timeline Overview

timeline
    title Robot Model Roadmap (2022-2026)
    2022 : RT-1
         : SayCan
         : LLM planning starts entering robotics
    2023 : RT-2
         : Octo
         : ACT
         : Diffusion Policy
         : Sequence modeling, generative policies, and VLA all accelerate together
    2024 : OpenVLA
         : pi0
         : HPT
         : RDT-1B
         : Genie / Cosmos / Genesis directions heat up
    2025 : pi0.5
         : FAST
         : Hierarchical VLA and action tokenization continue to evolve
    2026 : Current focus
         : Better deployability
         : Faster action representations
         : Fusion of VLA + world model + agents

The last line is a trend summary as of 2026-04, not a list of one-off paper titles.

10. Paradigm Tree and What to Read First

graph TD
    ROOT[Embodied AI models] --> IL[Classic imitation learning]
    ROOT --> SEQ[Sequence and generative policies]
    ROOT --> VLA[VLA / robot foundation models]
    ROOT --> WM[World models]
    ROOT --> PLAN[LLM planning]

    IL --> BC[BC / DAgger / GAIL]
    SEQ --> ACT[ACT]
    SEQ --> DP[Diffusion Policy]
    VLA --> RT[RT-1 / RT-2]
    VLA --> OCTO[Octo / OpenVLA / HPT]
    VLA --> PI[pi0 / RDT]
    WM --> DREAMER[Dreamer / UniSim / Cosmos]
    PLAN --> SAYCAN[SayCan / Code as Policies / VoxPoser]

    ACT --> VLA

    style ACT fill:#e8f5e9
    style VLA fill:#e3f2fd
    style WM fill:#fff3e0

11. Which Models Deserve Standalone Pages

A model should usually get its own page only if it satisfies at least one of these:

it opens a clear technical branch
it is still frequently used in engineering practice
it is a bridge for understanding later models

Using that rule, the current recommendation is:

Model / Direction	Recommendation
ACT	Already promoted to a standalone page because its bridge role is unusually clear
VLA mainline	Already has its own page because the family and timeline are rich enough
Diffusion Policy	Keep in Diffusion Policy, because it still fits better as a method branch
RT-2 / OpenVLA / pi0	Keep inside the VLA overview for now; split later if the content grows further
World models	Keep as a direction page first; no need to explode into many single-model pages yet

12. References

Brohan et al., RT-1: Robotics Transformer for Real-World Control at Scale, RSS 2023
Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, CoRL 2023
Octo Team, Octo: An Open-Source Generalist Robot Policy, RSS 2024
Kim et al., OpenVLA: An Open-Source Vision-Language-Action Model, 2024
Zhao et al., Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023
Chi et al., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023
Black et al., pi0: A Vision-Language-Action Flow Model for General Robot Control, 2024
Physical Intelligence, FAST: Efficient Robot Action Tokenization, 2025
Hafner et al., DreamerV3, 2023
Official pages:
- OpenVLA: https://openvla.github.io/
- Octo: https://octo-models.github.io/
- Tony Zhao / ALOHA + ACT: https://tonyzhaozh.github.io/
- pi0 PDF: https://www.physicalintelligence.company/download/pi0.pdf
- FAST: https://www.physicalintelligence.company/research/fast

Your question	Read first	Then read
I do not know how robot models are organized	Model Roadmap	Foundation Models for Robotics
I only care about mainstream VLAs	VLA Models	Open-Source Model Summary
I want to understand why ACT keeps coming up	ACT Model	Imitation Learning
I work on bimanual or dexterous manipulation	ACT Model	Diffusion Policy
I care about world models and generated simulation	World Models & Video Generation	Simulation World Building & Physics Rules
I care about open-source reproduction and training entrypoints	Open-Source Model Summary	Open-Source Frameworks