Diffusion Policy
Overview
Diffusion Policy (Chi et al., 2023) introduces Denoising Diffusion Probabilistic Models (DDPMs) into robot behavioral cloning, addressing the multimodal action distribution problem that traditional BC struggles with. In contact-rich manipulation tasks, diffusion policies demonstrate significantly superior performance compared to traditional methods.
Background: Denoising Diffusion Probabilistic Models
DDPM Review
DDPM (Ho et al., 2020) defines two processes:
Forward process (adding noise): Gradually adds Gaussian noise to data \(x_0\):
where \(\beta_t\) is the noise schedule. Using the cumulative parameter \(\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)\), one can directly sample noisy data at any time step \(t\) from \(x_0\):
That is:
Reverse process (denoising): Learns to gradually denoise from noise \(x_T \sim \mathcal{N}(0, I)\) back to data:
DDPM Training Objective
Through the simplified variational lower bound, the DDPM training objective is equivalent to noise prediction:
where \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), and \(\epsilon_\theta\) is the noise prediction network.
Derivation:
The reverse process posterior of DDPM is \(q(x_{t-1}|x_t, x_0)\), with mean:
Substituting \(x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)\) gives the mean parameterized by \(\epsilon_\theta\):
Minimizing \(\|\tilde{\mu}_t - \mu_\theta\|^2\) is equivalent to minimizing \(\|\epsilon - \epsilon_\theta\|^2\) (up to coefficients).
DDIM Accelerated Sampling
DDPM requires \(T\) steps (typically 1000) of denoising, which is too slow. DDIM (Song et al., 2021) provides a non-Markovian sampling process that can use \(S \ll T\) steps:
When \(\sigma_t = 0\), this gives deterministic sampling (DDIM); when \(\sigma_t = \sqrt{\beta_t}\), it degenerates to DDPM.
Diffusion Policy Formulation
Problem Setup
Given observation \(\mathbf{o}_t\) (images + robot state), predict the next \(T_a\) action steps:
where \(\mathbf{O}_t = (o_{t-T_o+1}, \ldots, o_t)\) is the observation history (\(T_o\) steps).
Conditional Denoising
The action sequence \(\mathbf{A}_t\) is treated as the "data" to be generated by the diffusion model. The conditional diffusion model is conditioned on observation \(\mathbf{O}_t\):
Training:
where \(k\) is the diffusion time step (distinct from the environment time step \(t\)), and \(\mathbf{A}_t^0\) is the expert action sequence.
Inference:
- Sample \(\mathbf{A}_t^K \sim \mathcal{N}(0, I)\)
- For \(k = K, K-1, \ldots, 1\):
- Execute the first \(T_e\) steps of \(\mathbf{A}_t^0\) (\(T_e \leq T_a\))
Action Space Parameters
| Parameter | Typical Value | Description |
|---|---|---|
| Observation history \(T_o\) | 2 | Current frame + 1 frame of history |
| Prediction horizon \(T_a\) | 16 | Predict 16 future action steps |
| Execution horizon \(T_e\) | 8 | Execute 8 steps then re-predict |
| Denoising steps \(K\) | 100 (training) / 10–16 (inference with DDIM) | Inference acceleration |
Network Architecture
Observation Encoder
Visual encoder: Uses a pretrained ResNet-18 to extract image features:
State encoder: Linear projection of robot proprioceptive state:
Fusion: Concatenate visual and state features, then project:
Noise Prediction Network
Diffusion Policy offers two architecture choices:
1. CNN-based (1D temporal convolution):
Treats the action sequence as a 1D signal, using a ResNet-style 1D convolutional network. Observations are injected via FiLM conditioning:
2. Transformer-based:
Uses a Transformer Decoder, with observation encodings as the key/value for cross-attention:
where the diffusion time step \(k\) is injected via sinusoidal positional encoding.
Architecture Comparison
| Dimension | CNN-based | Transformer-based |
|---|---|---|
| Inference speed | Faster | Slower |
| Parameter count | ~25M | ~40M |
| Long sequence modeling | Limited by receptive field | Global attention |
| Tuning difficulty | Easier | Harder |
| Typical tasks | General | Long-horizon tasks |
Why Diffusion Policy Outperforms Traditional BC
Multimodal Distribution Modeling
Traditional BC (MSE loss + deterministic network) can only output a single action. When the data contains multiple reasonable actions, MSE regression predicts the mean — which is typically not any of the reasonable actions.
Example: There is a cup on the table; one can go around it from the left or right side.
- Traditional BC: Predicts the middle path (may collide with the cup)
- Gaussian Mixture Model: Requires prespecifying the number of modes
- Diffusion Policy: Naturally models arbitrarily complex multimodal distributions
Contact-Rich Tasks
In tasks involving contact (e.g., insertion, folding, pouring), tiny position differences lead to vastly different mechanical behavior. The iterative denoising process of diffusion policy can precisely generate action sequences that satisfy contact constraints.
Experimental Comparison (Chi et al., 2023)
| Task | BC (MSE) | BC (GMM) | IBC | Diffusion Policy |
|---|---|---|---|---|
| Push-T | 56.5% | 65.1% | 62.3% | 86.2% |
| Can | 78.5% | 82.1% | 80.4% | 94.6% |
| Square | 45.2% | 51.8% | 48.9% | 84.7% |
| Transport | 35.1% | 42.3% | 38.7% | 73.8% |
3D Diffusion Policy (DP3)
Motivation
Standard Diffusion Policy takes RGB images as input, but images lack 3D geometric information. DP3 (Ze et al., 2024) uses point clouds as input, providing explicit 3D spatial understanding.
Architecture
Point cloud encoding: Uses PointNet++ or similar networks to encode point clouds \(P \in \mathbb{R}^{N \times 3}\) into compact representations:
Conditional diffusion: Same as standard Diffusion Policy, but using \(h_{\text{3D}}\) instead of \(h_{\text{img}}\) as the condition.
Advantages
- Viewpoint invariance: 3D representation is naturally robust to camera viewpoint changes
- Spatial reasoning: Explicitly encodes spatial relationships between objects
- Depth information: Precise distance information is crucial for manipulation tasks
Experimental Results
DP3 surpasses RGB-based Diffusion Policy on multiple benchmarks, especially on tasks requiring precise spatial reasoning (e.g., assembly, alignment).
Consistency Policy: Fast Inference
Problem
Standard Diffusion Policy requires 10–100 denoising iterations for inference, each requiring one neural network forward pass. At a 50Hz control frequency:
- 16-step DDIM: ~50ms per inference (barely meets 20Hz)
- 100-step DDPM: ~300ms (far from meeting real-time requirements)
Consistency Model Review
Consistency Model (Song et al., 2023) learns a function \(f_\theta\) that directly maps any point on the diffusion trajectory to the starting point \(x_0\):
Self-consistency condition: For any two points \(x_t\) and \(x_{t'}\) on the same trajectory, \(f_\theta(x_t, t) = f_\theta(x_{t'}, t')\).
Consistency Policy
Applying the Consistency Model to robot policies (Prasad et al., 2024):
Training objective (distillation approach):
where \(\hat{\mathbf{A}}^{t_k}\) is the result of one-step denoising from \(\mathbf{A}^{t_{k+1}}\) using the pretrained diffusion model, and \(\theta^-\) is the EMA target network.
Inference: Only 1–2 steps are needed to generate high-quality action sequences:
Speed Comparison
| Method | Denoising Steps | Inference Time | Control Frequency |
|---|---|---|---|
| DDPM | 100 | ~300 ms | ~3 Hz |
| DDIM | 16 | ~50 ms | ~20 Hz |
| Consistency Policy | 1–2 | ~5 ms | ~200 Hz |
Consistency Policy achieves 10–60x inference speedup, enabling diffusion policies for high-frequency control scenarios.
Implementation Tips
Action Normalization
Diffusion models assume data lies in the \([-1, 1]\) range. Actions need to be normalized:
Observation and Action Windows
Important hyperparameters in practice:
- Observation window: Use the most recent \(T_o = 2\) observations
- Prediction window: Predict \(T_a = 16\) future action steps
- Execution window: Execute \(T_e = 8\) steps then re-plan (receding horizon)
Overlapping execution provides smooth action transitions.
Training Tricks
- EMA model: Use an exponential moving average model for inference
- Learning rate schedule: Cosine annealing
- Data augmentation: Random cropping, color jittering
- Gradient clipping: Prevents training instability
Connections to Other Chapters
- Imitation learning: Imitation Learning — BC is the baseline comparison method for diffusion policy
- VLA models: VLA Models — some models also employ diffusion decoders
- Teleoperation: Teleoperation and Data Collection provides the high-quality demonstration data needed by diffusion policy
References
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
- Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models. ICLR.
- Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.
- Ze, Y., et al. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. RSS.
- Song, Y., et al. (2023). Consistency Models. ICML.
- Prasad, V., et al. (2024). Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation. RSS.