Skip to content

Imitation Learning

Overview

Imitation Learning (IL) is one of the most practical paradigms in robot learning: given an expert demonstration dataset \(\mathcal{D} = \{(o_1, a_1^*), (o_2, a_2^*), \ldots, (o_N, a_N^*)\}\), the goal is to learn a policy \(\pi_\theta: \mathcal{O} \rightarrow \mathcal{A}\) that enables the robot to reproduce expert behavior.

Unlike reinforcement learning, imitation learning does not require a reward function, significantly lowering the barrier for task design. However, it faces unique challenges: distribution shift, multimodal action distributions, and data acquisition cost.


Behavioral Cloning (BC)

Basic Formulation

Behavioral Cloning (BC) reduces imitation learning to a standard supervised learning problem. Given an expert dataset \(\mathcal{D} = \{(\mathbf{o}_i, \mathbf{a}_i^*)\}_{i=1}^N\), the objective is to minimize the loss between the policy's predictions and expert actions:

\[ \mathcal{L}_{\text{BC}} = \mathbb{E}_{(\mathbf{o}, \mathbf{a}^*) \sim \mathcal{D}} \left[ \| \pi_\theta(\mathbf{o}) - \mathbf{a}^* \|^2 \right] \]

For a deterministic policy, this is equivalent to least-squares regression. For a stochastic policy \(\pi_\theta(\mathbf{a}|\mathbf{o})\), the objective becomes maximizing the log-likelihood:

\[ \mathcal{L}_{\text{BC}}^{\text{NLL}} = -\mathbb{E}_{(\mathbf{o}, \mathbf{a}^*) \sim \mathcal{D}} \left[ \log \pi_\theta(\mathbf{a}^* | \mathbf{o}) \right] \]

Gaussian Policy

Assuming the action follows a conditional Gaussian distribution \(\pi_\theta(\mathbf{a}|\mathbf{o}) = \mathcal{N}(\mu_\theta(\mathbf{o}), \Sigma_\theta(\mathbf{o}))\), the negative log-likelihood is:

\[ -\log \pi_\theta(\mathbf{a}^*|\mathbf{o}) = \frac{1}{2}(\mathbf{a}^* - \mu_\theta)^\top \Sigma_\theta^{-1} (\mathbf{a}^* - \mu_\theta) + \frac{1}{2}\log|\Sigma_\theta| + \frac{d}{2}\log(2\pi) \]

When \(\Sigma_\theta = \sigma^2 I\) is constant, this degenerates to the MSE loss.

Input Representation

BC policy inputs typically include:

Input Type Representation Dimension Example
Joint positions \(q \in \mathbb{R}^{n}\) 7-DOF arm: \(\mathbb{R}^7\)
Joint velocities \(\dot{q} \in \mathbb{R}^{n}\) \(\mathbb{R}^7\)
End-effector pose \((p, R) \in SE(3)\) \(\mathbb{R}^{7}\) (position + quaternion)
RGB image \(I \in \mathbb{R}^{H \times W \times 3}\) \(224 \times 224 \times 3\)
Depth map \(D \in \mathbb{R}^{H \times W}\) \(224 \times 224\)
Point cloud \(P \in \mathbb{R}^{N \times 3}\) \(N = 1024\)
Language instruction Token embedding \(\mathbb{R}^{512}\)

Distribution Shift Problem

Theoretical Analysis

The core deficiency of BC is distribution shift, also known as covariate shift. During training, the policy learns on states from the expert distribution \(d_{\pi^*}\); during deployment, small deviations lead the policy to unseen states, causing error accumulation.

Theorem (Ross & Bagnell, 2010): Suppose the policy \(\pi_\theta\) makes an error with probability \(\epsilon\) at each step (i.e., \(\mathbb{E}_{o \sim d_{\pi^*}}[\mathbb{1}[\pi_\theta(o) \neq \pi^*(o)]] \leq \epsilon\)), then over a trajectory of length \(T\), the expected total loss of the BC policy is:

\[ J(\pi_\theta) - J(\pi^*) \leq \epsilon T^2 \]

That is, the error grows quadratically with the time horizon. This means that for long-horizon tasks (e.g., \(T = 100\)), even with per-step error \(\epsilon = 0.01\), the total error can reach \(100\).

Intuitive Understanding

Consider a simple walking-in-a-straight-line task:

  1. The expert always walks in a straight line; all training data comes from states on the line
  2. The BC policy learns "when on the line, keep going forward"
  3. During deployment, a slight deviation takes the robot off the line
  4. Off the line, the policy has never seen any data and outputs randomly
  5. Random actions cause larger deviations, leading to cascading error accumulation

DAgger Algorithm

Algorithm Framework

DAgger (Dataset Aggregation, Ross et al., 2011) addresses distribution shift by querying the expert on states visited by the policy itself:

Algorithm:

  1. Initialize: Train initial policy \(\pi_0\) on expert data \(\mathcal{D}_0\)
  2. For \(i = 1, 2, \ldots, N\):
    • Collect trajectories \(\tau_i\) using current policy \(\pi_i\)
    • For each state \(o_t\) in the trajectory, query the expert label \(a_t^* = \pi^*(o_t)\)
    • Aggregate dataset: \(\mathcal{D}_{i+1} = \mathcal{D}_i \cup \{(o_t, a_t^*)\}\)
    • Retrain policy \(\pi_{i+1}\) on \(\mathcal{D}_{i+1}\)

Theoretical Guarantee: DAgger reduces the error bound from \(O(\epsilon T^2)\) to \(O(\epsilon T)\) (linear growth).

Limitations

DAgger requires online expert queries, which is costly in robotics:

  • Requires a human operator to provide real-time annotations
  • The interaction process is time-consuming and hard to scale
  • In some states, even the expert struggles to provide optimal actions

Inverse Reinforcement Learning (IRL)

Problem Definition

Inverse Reinforcement Learning (IRL) does not directly learn a policy; instead, it infers the reward function \(r(s, a)\) from expert demonstrations, then solves for the optimal policy using standard RL.

Motivation: Reward functions are more compact and transferable than policies. A learned reward function can adapt to new environment dynamics.

Maximum Entropy IRL (MaxEntIRL)

Core Assumption: The expert acts in a maximum-entropy manner subject to the constraint of behaving close to the demonstrations.

Derivation: Define the trajectory reward as \(R(\tau) = \sum_{t=0}^T r(s_t, a_t)\), and the maximum entropy distribution as:

\[ p(\tau) = \frac{1}{Z} \exp(R(\tau)) \]

where the partition function \(Z = \int \exp(R(\tau)) d\tau\).

The objective is to maximize the log-likelihood of expert trajectories:

\[ \max_r \mathbb{E}_{\tau \sim \mathcal{D}} [\log p(\tau)] = \max_r \mathbb{E}_{\tau \sim \mathcal{D}} \left[ R(\tau) \right] - \log Z \]

Gradient Computation:

\[ \nabla_r \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}} \left[ \nabla_r R(\tau) \right] - \mathbb{E}_{\tau \sim p(\tau)} \left[ \nabla_r R(\tau) \right] \]

That is, feature expectation of the expert minus feature expectation of the current policy. This is the classic feature matching condition.

Practical Difficulties of IRL

  1. Inner-loop optimization: After each reward update, the RL problem must be re-solved (computationally expensive)
  2. Reward ambiguity: Multiple reward functions can explain the same set of demonstrations
  3. High-dimensional problems: Feature engineering is challenging

GAIL: Generative Adversarial Imitation Learning

Adversarial Framework

GAIL (Generative Adversarial Imitation Learning, Ho & Ermon, 2016) elegantly bypasses the inner-loop optimization of IRL by casting imitation learning as an adversarial game.

Objective Function:

\[ \min_\pi \max_D \mathbb{E}_{\pi}[\log D(s, a)] + \mathbb{E}_{\pi^*}[\log(1 - D(s, a))] - \lambda H(\pi) \]

where: - \(D(s, a)\): Discriminator, distinguishing between policy-generated and expert \((s, a)\) pairs - \(\pi\): Generator (policy), attempting to fool the discriminator - \(H(\pi)\): Entropy regularization term for the policy, encouraging exploration

Training Procedure:

  1. Collect trajectories using current policy \(\pi\)
  2. Train discriminator \(D\) to distinguish between policy and expert data
  3. Use \(-\log D(s, a)\) as reward and update the policy with RL (e.g., TRPO/PPO)
  4. Repeat steps 1–3

Correspondence with GANs

GAN Component GAIL Counterpart
Generator \(G\) Policy \(\pi_\theta\)
Discriminator \(D\) Reward function \(r(s,a) = -\log D(s,a)\)
Generated samples Policy trajectories \((s, a) \sim \pi\)
Real samples Expert trajectories \((s, a) \sim \pi^*\)
Backpropagation RL policy gradient

Theoretical Connection of GAIL

Theorem (Ho & Ermon, 2016): When the discriminator is optimal, GAIL minimizes the Jensen-Shannon divergence between the policy occupancy measure and the expert occupancy measure:

\[ \text{GAIL} \Leftrightarrow \min_\pi D_{\text{JS}}(\rho_\pi \| \rho_{\pi^*}) \]

where the occupancy measure \(\rho_\pi(s, a) = \pi(a|s) \sum_{t=0}^T \gamma^t P(s_t = s | \pi)\).


ACT: Action Chunking with Transformers

ACT is the representative bridge from classic imitation learning to chunk-based policies. This page keeps it as a method entry; for the fuller model-level explanation, read ACT Model.

Why it still belongs to the imitation-learning line

ACT is still standard offline imitation learning: it fits a policy directly from demonstrations, without a reward function and without online RL. Its key innovation is not a new training paradigm, but changing the output object of BC into a future action chunk.

Core structure

graph LR
    OBS[Observation o_t + state s_t] --> ENC[Encoder]
    GT[Future action chunk a_t:t+H-1] --> CVAE[CVAE encoder]
    ENC --> DEC[Transformer decoder]
    CVAE --> Z[z]
    Z --> DEC
    DEC --> CHUNK[Predicted action chunk]

    style ENC fill:#e3f2fd
    style CVAE fill:#fff3e0
    style DEC fill:#e8f5e9

Key ideas

  1. Action chunking: predict the next \(H\) actions at once instead of only \(a_t\).
  2. CVAE latent: model multiple valid action styles under the same observation.
  3. Temporal ensembling: fuse overlapping chunks at execution time to reduce jitter.

The chunk prediction can be written as:

\[ \hat{\mathbf{a}}_{t:t+H-1} = \pi_\theta(o_t, s_t) \]

The training objective is typically a reconstruction term plus KL regularization:

\[ \mathcal{L}_{\text{ACT}} = \mathcal{L}_{\text{recon}} + \beta D_{\text{KL}}(q_\phi(z|o, s, \mathbf{a}^*) \| \mathcal{N}(0, I)) \]

Why it matters in the imitation-learning lineage

ACT showed that:

  • a small number of high-quality demonstrations can still solve fine bimanual tasks
  • imitation learning does not have to stay with single-step regression
  • the action chunk can become a natural representation for later, larger models

If you want the full story of how ACT influenced later VLAs and action tokenization work, continue with ACT Model and Model Roadmap.


Method Comparison

Method Data Requirements Online Interaction Multimodal Long-horizon Typical Application
BC Offline demos No Poor Poor Rapid prototyping
DAgger Offline + Online Yes Poor Medium Autonomous driving
IRL/MaxEntIRL Offline demos Yes (RL) Good Good Reward learning
GAIL Offline demos Yes (RL) Good Good Motion imitation
ACT Offline demos No Good (CVAE) Medium Dexterous manipulation
Diffusion Policy Offline demos No Excellent Good Contact-rich tasks

  1. Diffusion Policy: Using denoising diffusion models to model multimodal action distributions, see Diffusion Policy
  2. VLA Models: Combining vision-language foundation models with action prediction, see VLA Models
  3. Data Scaling: Large-scale data collection through better teleoperation systems, see Teleoperation and Data Collection
  4. Cross-embodiment Transfer: Transferring imitation learning policies across different robot morphologies

References

  1. Pomerleau, D. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS.
  2. Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
  3. Ziebart, B., et al. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI.
  4. Ho, J. & Ermon, S. (2016). Generative Adversarial Imitation Learning. NeurIPS.
  5. Zhao, T., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS.
  6. Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.

评论 #