ACT Model

ACT (Action Chunking with Transformers) is one of the most representative models in embodied intelligence: it is not the biggest or newest foundation model, but it made action chunking into a clear paradigm, which is why it still matters for understanding later VLA action heads, generative policies, and low-cost bimanual manipulation.

This note treats ACT not as "a subsection inside imitation learning," but as a model-level milestone: what problem it solved, why it mattered in 2023, and how it relates to Diffusion Policy, OpenVLA, and pi0.

Related notes: Model Roadmap | Imitation Learning | VLA Models | Open-Source Model Summary | Teleoperation and Data Collection

1. What ACT Is and Why It Deserves Its Own Page

One-sentence definition:

Instead of predicting a single action for the current step, ACT predicts a short future action chunk and executes it with temporal smoothing.

Formally, it rewrites classic behavioral cloning from

\[ \pi(o_t) \rightarrow a_t \]

to

\[ \pi(o_t, s_t) \rightarrow \hat{\mathbf{a}}_{t:t+H-1} \]

where:

\(o_t\) is the visual observation
\(s_t\) is the robot state
\(H\) is the chunk horizon
the output is a sequence of future actions rather than one action

It deserves a standalone note for three reasons:

Its historical position is unusual: it is the bridge from classic imitation learning to chunk-based policies.
Its engineering value is strong: few demonstrations, low-cost hardware, and bimanual precision tasks still make it highly reproducible.
It influenced later model design: many later VLA, tokenized-action, and fast-inference lines keep the same "predict a chunk, not a single step" intuition.

2. Problem Setting

ACT was built for a very specific but difficult class of problems:

fine-grained bimanual manipulation
few demonstrations
low-cost hardware, especially ALOHA-style bimanual teleoperation
high-frequency control with temporal consistency

The main weaknesses of standard single-step BC are:

Problem	Manifestation
Single-step jitter	Each action is predicted independently, making trajectories noisy
Multimodality	Multiple valid action paths may exist for the same observation
Long-horizon accumulation	Small step errors snowball over time
Hard bimanual coordination	The two arms need both temporal continuity and spatial synchronization

ACT answers with:

CVAE for multimodality
action chunking for short-horizon prediction
temporal ensembling for smoothing overlapping predictions

3. Core Ideas

3.1 Action Chunking

Instead of predicting only \(a_t\), ACT predicts a short-horizon action block:

\[ \hat{\mathbf{a}}_{t:t+H-1} = \left(\hat{a}_t, \hat{a}_{t+1}, \ldots, \hat{a}_{t+H-1}\right) \]

This has three direct benefits:

the model explicitly learns short-term temporal structure
inference calls are reduced
actions become more temporally coherent

3.2 Temporal Ensembling

The model repeatedly predicts future chunks at consecutive times, so a given time step can have multiple overlapping predictions. ACT fuses them with exponentially decayed averaging:

\[ a_t^{\text{exec}} = \frac{\sum_i w_i \hat{a}_t^{(i)}}{\sum_i w_i}, \quad w_i = \exp(-m \Delta t_i) \]

Newer predictions get larger weights; older ones fade out.

3.3 CVAE Latent

ACT does not use purely deterministic regression. It introduces a latent variable \(z\) to encode an action style or mode:

\[ z \sim q_\phi(z \mid o_t, s_t, \mathbf{a}_{t:t+H-1}) \]

During training, the encoder sees the observation and the future action chunk and compresses that mode into the latent; during inference, the model falls back to a stable prior-based latent and outputs a chunk directly.

4. Model Architecture

4.1 Overall structure

graph LR
    subgraph Inputs["Inputs"]
        IMG[Multi-view images] --> ENC
        STATE[Robot state] --> ENC
    end

    subgraph TrainEncoder["CVAE encoder during training"]
        GT[Future action chunk] --> LAT
        ENC --> LAT
        LAT[Infer q(z|obs, actions)] --> Z[z]
    end

    subgraph Decoder["Transformer decoder"]
        ENC[Vision + state encoding] --> TF[Transformer]
        Z --> TF
        TF --> CHUNK[Predicted action chunk]
    end

    style Inputs fill:#e3f2fd
    style TrainEncoder fill:#fff3e0
    style Decoder fill:#e8f5e9

4.2 Inputs and outputs

Typical ACT inputs include:

multi-view RGB images
arm joint state or end-effector state
gripper state

The output is:

a future action chunk of length \(H\)
often bimanual joint or end-effector control commands

4.3 Division of labor

Component	Role
Visual encoder	Extract features from multiple camera views
State encoder	Represent proprioception
CVAE encoder	During training, infer a latent style variable from the future expert actions
Transformer decoder	Produce a future action chunk conditioned on observation and latent

5. Mathematical Objective

5.1 Chunk prediction

ACT learns:

\[ p_\theta(\mathbf{a}_{t:t+H-1} \mid o_t, s_t, z) \]

instead of only:

\[ p_\theta(a_t \mid o_t) \]

5.2 Reconstruction loss

The decoded chunk should match the expert future chunk:

\[ \mathcal{L}_{\text{recon}} = \sum_{j=0}^{H-1} \left\| \hat{a}_{t+j} - a^*_{t+j} \right\|_2^2 \]

5.3 KL regularization

The CVAE posterior is kept close to a standard normal prior:

\[ \mathcal{L}_{\text{KL}} = D_{\text{KL}}\left(q_\phi(z \mid o_t, s_t, \mathbf{a}^*) \,\|\, \mathcal{N}(0, I)\right) \]

5.4 Total loss

The training objective is typically:

\[ \mathcal{L}_{\text{ACT}} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}} \]

where \(\beta\) controls the strength of the latent regularization.

6. Inference Procedure

Training and inference are not symmetric in ACT, which is one of the most important things to understand about it.

graph TD
    T1[Training] --> T2[Observation + future expert actions]
    T2 --> T3[CVAE encoder infers z]
    T3 --> T4[Transformer outputs chunk]
    T4 --> T5[Reconstruction loss + KL]

    I1[Inference] --> I2[Current observation only]
    I2 --> I3[No future expert actions available]
    I3 --> I4[Use prior / fixed latent]
    I4 --> I5[Transformer outputs chunk]
    I5 --> I6[Temporal ensembling for smooth execution]

    style T1 fill:#fff3e0
    style I1 fill:#e8f5e9

6.1 Why inference often uses `z = 0`

Because training pulls the posterior toward a standard normal prior, inference can use the prior mean as a stable default:

\[ z_{\text{test}} = 0 \]

This gives the model a deterministic and stable default action style at test time.

6.2 Why temporal ensembling is still needed

Even with chunk prediction, contact-rich tasks still suffer from vision noise, slight state mismatch, and occlusions. Overlapping chunk fusion helps reduce:

jitter
arm desynchronization
small oscillations near contact boundaries

7. Why ACT Mattered in 2023

7.1 It showed that few high-quality demos plus the right structure can solve precision tasks

What made ACT memorable was not parameter count, but that it achieved strong success on fine-grained bimanual tasks on ALOHA with roughly 50 demonstrations.

7.2 It turned chunk-based policy learning into a clear line

After ACT, predicting a short future chunk stopped feeling exotic. More and more later work treated the following as natural:

predict several future actions at once
treat time horizon as an explicit design variable
represent actions as chunks, tokens, or generated trajectories

7.3 Together with ALOHA, it formed a low-cost embodied research stack

ACT's impact came not only from the paper, but from the fact that it sat inside a practical loop of low-cost hardware, teleoperation, demonstrations, and code. For many labs, it became one of the first bimanual manipulation systems they could realistically reproduce.

8. Limitations of ACT

ACT is important, but its boundaries are equally important:

Limitation	Explanation
Narrower task scope	Best suited for tabletop, short-horizon manipulation-heavy tasks
Limited generalization	It is not a web-scale pretrained foundation model
Sensitive to the data distribution	Camera placement, action definitions, and teleop style matter a lot
Weak semantics	It does not inherit strong language or world knowledge like a VLA
Still requires control tuning	Chunk size, replanning frequency, and temporal weights matter

9. Relationship to Neighboring Methods

9.1 ACT vs BC

BC usually predicts a single step
ACT predicts an action chunk and models multimodality with a latent variable

9.2 ACT vs Diffusion Policy

ACT uses CVAE + Transformer and is usually lighter at inference
Diffusion Policy models richer action distributions but pays more inference cost

9.3 ACT vs BeT / tokenized actions

ACT predicts continuous action chunks
BeT and later tokenizers push harder on turning actions into discrete tokens for Transformer-native modeling

9.4 ACT vs VLA / OpenVLA / pi0 / FAST

Dimension	ACT	Diffusion Policy	OpenVLA	pi0
Main role	Fine manipulation policy	Generative manipulation policy	Open VLA	Flow-based VLA
Inputs	image + state	image + state	image + language + state	image + language + state
Output	continuous action chunk	diffusion-generated action sequence	discrete action tokens	continuous chunk / flow
Language ability	weak	weak	strong	strong
Typical data scale	tens to hundreds of demos	medium offline imitation datasets	close to one million episodes	large multi-robot data
Core value	establishes the chunked policy paradigm	stronger multimodal action modeling	reproducible VLA mainline	stronger continuous action generation and cross-embodiment transfer

The key relationship is not "who replaced whom," but:

ACT made chunked action prediction mainstream
Diffusion Policy pushed generative action modeling further
OpenVLA and pi0 moved action modeling into larger vision-language-action foundation models

10. Engineering Considerations

10.1 Data organization

ACT depends heavily on high-quality demonstrations. In practice you need:

synchronized multi-camera streams
tight alignment between robot state and image frames
a consistent bimanual action definition
stable teleoperation style

10.2 Key hyperparameters

Hyperparameter	Purpose	If increased	If decreased
`chunk_size / horizon`	predicted future length	stronger short-term planning, more stability demands	closer to single-step control
`beta`	KL regularization strength	cleaner latent space, possible underfitting	more flexibility, more overfitting risk
`batch_size`	optimization stability	smoother gradients, more memory	noisier updates
`camera_views`	visual coverage	fewer occlusions, more engineering overhead	easier setup, less information
`replan interval`	how often to predict a new chunk	more responsive, more compute	cheaper, less adaptive

10.3 Training cost

One practical strength of ACT is that it is much cheaper to train than 7B-scale VLAs. It is a strong choice as:

a first baseline for small labs doing bimanual precision manipulation
the first model on top of a LeRobot / ALOHA-style data stack
a diagnosis tool for separating data problems from model problems

11. Good and Bad Use Cases

Better fit

bimanual tabletop precision tasks
short- to medium-horizon manipulation
small but high-quality demonstration datasets
tasks where temporal consistency matters but diffusion inference is too expensive

Worse fit

strong language understanding and open-vocabulary instructions
direct zero-shot transfer across many robot platforms
long-horizon high-level planning
open-world foundation-model ambitions

12. Open-Source Ecosystem and Reproduction

A major reason ACT still deserves attention is that it still has a clear open-source path:

ALOHA / project page: low-cost bimanual teleoperation + ACT
official code repo: tonyzhaozh/act
LeRobot: keeps ACT as an important baseline inside a unified training framework

A practical engineering path is often:

validate the pipeline first with LeRobot or simulation
move to real ALOHA-style bimanual collection
only then decide whether to upgrade to Diffusion Policy or a larger VLA

13. Why It Matters but Is Not the Endpoint

ACT's historical value is that it made "the action chunk" into a first-class prediction object;
but it did not solve the biggest foundation-model-era problems: large-scale cross-embodiment pretraining, strong language generalization, or open-world long-horizon planning.

So the most accurate positioning is:

not the endpoint
not an obsolete baseline
a bridge

If you are building a mental model of the field, read it together with:

to the left: Imitation Learning
to the right: VLA Models, Model Roadmap

14. References

Zhao et al., Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023
ACT repository: https://github.com/tonyzhaozh/act
Tony Zhao / ALOHA + ACT project page: https://tonyzhaozh.github.io/
LeRobot ACT docs: https://huggingface.co/docs/lerobot/act
Chi et al., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023
Kim et al., OpenVLA: An Open-Source Vision-Language-Action Model, 2024
Black et al., pi0: A Vision-Language-Action Flow Model for General Robot Control, 2024
Physical Intelligence, FAST: Efficient Robot Action Tokenization, 2025