Imitation Learning
Overview
Imitation Learning (IL) is one of the most practical paradigms in robot learning: given an expert demonstration dataset \(\mathcal{D} = \{(o_1, a_1^*), (o_2, a_2^*), \ldots, (o_N, a_N^*)\}\), the goal is to learn a policy \(\pi_\theta: \mathcal{O} \rightarrow \mathcal{A}\) that enables the robot to reproduce expert behavior.
Unlike reinforcement learning, imitation learning does not require a reward function, significantly lowering the barrier for task design. However, it faces unique challenges: distribution shift, multimodal action distributions, and data acquisition cost.
Behavioral Cloning (BC)
Basic Formulation
Behavioral Cloning (BC) reduces imitation learning to a standard supervised learning problem. Given an expert dataset \(\mathcal{D} = \{(\mathbf{o}_i, \mathbf{a}_i^*)\}_{i=1}^N\), the objective is to minimize the loss between the policy's predictions and expert actions:
For a deterministic policy, this is equivalent to least-squares regression. For a stochastic policy \(\pi_\theta(\mathbf{a}|\mathbf{o})\), the objective becomes maximizing the log-likelihood:
Gaussian Policy
Assuming the action follows a conditional Gaussian distribution \(\pi_\theta(\mathbf{a}|\mathbf{o}) = \mathcal{N}(\mu_\theta(\mathbf{o}), \Sigma_\theta(\mathbf{o}))\), the negative log-likelihood is:
When \(\Sigma_\theta = \sigma^2 I\) is constant, this degenerates to the MSE loss.
Input Representation
BC policy inputs typically include:
| Input Type | Representation | Dimension Example |
|---|---|---|
| Joint positions | \(q \in \mathbb{R}^{n}\) | 7-DOF arm: \(\mathbb{R}^7\) |
| Joint velocities | \(\dot{q} \in \mathbb{R}^{n}\) | \(\mathbb{R}^7\) |
| End-effector pose | \((p, R) \in SE(3)\) | \(\mathbb{R}^{7}\) (position + quaternion) |
| RGB image | \(I \in \mathbb{R}^{H \times W \times 3}\) | \(224 \times 224 \times 3\) |
| Depth map | \(D \in \mathbb{R}^{H \times W}\) | \(224 \times 224\) |
| Point cloud | \(P \in \mathbb{R}^{N \times 3}\) | \(N = 1024\) |
| Language instruction | Token embedding | \(\mathbb{R}^{512}\) |
Distribution Shift Problem
Theoretical Analysis
The core deficiency of BC is distribution shift, also known as covariate shift. During training, the policy learns on states from the expert distribution \(d_{\pi^*}\); during deployment, small deviations lead the policy to unseen states, causing error accumulation.
Theorem (Ross & Bagnell, 2010): Suppose the policy \(\pi_\theta\) makes an error with probability \(\epsilon\) at each step (i.e., \(\mathbb{E}_{o \sim d_{\pi^*}}[\mathbb{1}[\pi_\theta(o) \neq \pi^*(o)]] \leq \epsilon\)), then over a trajectory of length \(T\), the expected total loss of the BC policy is:
That is, the error grows quadratically with the time horizon. This means that for long-horizon tasks (e.g., \(T = 100\)), even with per-step error \(\epsilon = 0.01\), the total error can reach \(100\).
Intuitive Understanding
Consider a simple walking-in-a-straight-line task:
- The expert always walks in a straight line; all training data comes from states on the line
- The BC policy learns "when on the line, keep going forward"
- During deployment, a slight deviation takes the robot off the line
- Off the line, the policy has never seen any data and outputs randomly
- Random actions cause larger deviations, leading to cascading error accumulation
DAgger Algorithm
Algorithm Framework
DAgger (Dataset Aggregation, Ross et al., 2011) addresses distribution shift by querying the expert on states visited by the policy itself:
Algorithm:
- Initialize: Train initial policy \(\pi_0\) on expert data \(\mathcal{D}_0\)
- For \(i = 1, 2, \ldots, N\):
- Collect trajectories \(\tau_i\) using current policy \(\pi_i\)
- For each state \(o_t\) in the trajectory, query the expert label \(a_t^* = \pi^*(o_t)\)
- Aggregate dataset: \(\mathcal{D}_{i+1} = \mathcal{D}_i \cup \{(o_t, a_t^*)\}\)
- Retrain policy \(\pi_{i+1}\) on \(\mathcal{D}_{i+1}\)
Theoretical Guarantee: DAgger reduces the error bound from \(O(\epsilon T^2)\) to \(O(\epsilon T)\) (linear growth).
Limitations
DAgger requires online expert queries, which is costly in robotics:
- Requires a human operator to provide real-time annotations
- The interaction process is time-consuming and hard to scale
- In some states, even the expert struggles to provide optimal actions
Inverse Reinforcement Learning (IRL)
Problem Definition
Inverse Reinforcement Learning (IRL) does not directly learn a policy; instead, it infers the reward function \(r(s, a)\) from expert demonstrations, then solves for the optimal policy using standard RL.
Motivation: Reward functions are more compact and transferable than policies. A learned reward function can adapt to new environment dynamics.
Maximum Entropy IRL (MaxEntIRL)
Core Assumption: The expert acts in a maximum-entropy manner subject to the constraint of behaving close to the demonstrations.
Derivation: Define the trajectory reward as \(R(\tau) = \sum_{t=0}^T r(s_t, a_t)\), and the maximum entropy distribution as:
where the partition function \(Z = \int \exp(R(\tau)) d\tau\).
The objective is to maximize the log-likelihood of expert trajectories:
Gradient Computation:
That is, feature expectation of the expert minus feature expectation of the current policy. This is the classic feature matching condition.
Practical Difficulties of IRL
- Inner-loop optimization: After each reward update, the RL problem must be re-solved (computationally expensive)
- Reward ambiguity: Multiple reward functions can explain the same set of demonstrations
- High-dimensional problems: Feature engineering is challenging
GAIL: Generative Adversarial Imitation Learning
Adversarial Framework
GAIL (Generative Adversarial Imitation Learning, Ho & Ermon, 2016) elegantly bypasses the inner-loop optimization of IRL by casting imitation learning as an adversarial game.
Objective Function:
where: - \(D(s, a)\): Discriminator, distinguishing between policy-generated and expert \((s, a)\) pairs - \(\pi\): Generator (policy), attempting to fool the discriminator - \(H(\pi)\): Entropy regularization term for the policy, encouraging exploration
Training Procedure:
- Collect trajectories using current policy \(\pi\)
- Train discriminator \(D\) to distinguish between policy and expert data
- Use \(-\log D(s, a)\) as reward and update the policy with RL (e.g., TRPO/PPO)
- Repeat steps 1–3
Correspondence with GANs
| GAN Component | GAIL Counterpart |
|---|---|
| Generator \(G\) | Policy \(\pi_\theta\) |
| Discriminator \(D\) | Reward function \(r(s,a) = -\log D(s,a)\) |
| Generated samples | Policy trajectories \((s, a) \sim \pi\) |
| Real samples | Expert trajectories \((s, a) \sim \pi^*\) |
| Backpropagation | RL policy gradient |
Theoretical Connection of GAIL
Theorem (Ho & Ermon, 2016): When the discriminator is optimal, GAIL minimizes the Jensen-Shannon divergence between the policy occupancy measure and the expert occupancy measure:
where the occupancy measure \(\rho_\pi(s, a) = \pi(a|s) \sum_{t=0}^T \gamma^t P(s_t = s | \pi)\).
ACT: Action Chunking with Transformers
ACT is the representative bridge from classic imitation learning to chunk-based policies. This page keeps it as a method entry; for the fuller model-level explanation, read ACT Model.
Why it still belongs to the imitation-learning line
ACT is still standard offline imitation learning: it fits a policy directly from demonstrations, without a reward function and without online RL. Its key innovation is not a new training paradigm, but changing the output object of BC into a future action chunk.
Core structure
graph LR
OBS[Observation o_t + state s_t] --> ENC[Encoder]
GT[Future action chunk a_t:t+H-1] --> CVAE[CVAE encoder]
ENC --> DEC[Transformer decoder]
CVAE --> Z[z]
Z --> DEC
DEC --> CHUNK[Predicted action chunk]
style ENC fill:#e3f2fd
style CVAE fill:#fff3e0
style DEC fill:#e8f5e9
Key ideas
- Action chunking: predict the next \(H\) actions at once instead of only \(a_t\).
- CVAE latent: model multiple valid action styles under the same observation.
- Temporal ensembling: fuse overlapping chunks at execution time to reduce jitter.
The chunk prediction can be written as:
The training objective is typically a reconstruction term plus KL regularization:
Why it matters in the imitation-learning lineage
ACT showed that:
- a small number of high-quality demonstrations can still solve fine bimanual tasks
- imitation learning does not have to stay with single-step regression
- the action chunk can become a natural representation for later, larger models
If you want the full story of how ACT influenced later VLAs and action tokenization work, continue with ACT Model and Model Roadmap.
Method Comparison
| Method | Data Requirements | Online Interaction | Multimodal | Long-horizon | Typical Application |
|---|---|---|---|---|---|
| BC | Offline demos | No | Poor | Poor | Rapid prototyping |
| DAgger | Offline + Online | Yes | Poor | Medium | Autonomous driving |
| IRL/MaxEntIRL | Offline demos | Yes (RL) | Good | Good | Reward learning |
| GAIL | Offline demos | Yes (RL) | Good | Good | Motion imitation |
| ACT | Offline demos | No | Good (CVAE) | Medium | Dexterous manipulation |
| Diffusion Policy | Offline demos | No | Excellent | Good | Contact-rich tasks |
Frontier Trends
- Diffusion Policy: Using denoising diffusion models to model multimodal action distributions, see Diffusion Policy
- VLA Models: Combining vision-language foundation models with action prediction, see VLA Models
- Data Scaling: Large-scale data collection through better teleoperation systems, see Teleoperation and Data Collection
- Cross-embodiment Transfer: Transferring imitation learning policies across different robot morphologies
References
- Pomerleau, D. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS.
- Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
- Ziebart, B., et al. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI.
- Ho, J. & Ermon, S. (2016). Generative Adversarial Imitation Learning. NeurIPS.
- Zhao, T., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS.
- Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.