Diffusion Policy

Overview

Diffusion Policy (Chi et al., 2023) introduces Denoising Diffusion Probabilistic Models (DDPMs) into robot behavioral cloning, addressing the multimodal action distribution problem that traditional BC struggles with. In contact-rich manipulation tasks, diffusion policies demonstrate significantly superior performance compared to traditional methods.

Background: Denoising Diffusion Probabilistic Models

DDPM Review

DDPM (Ho et al., 2020) defines two processes:

Forward process (adding noise): Gradually adds Gaussian noise to data \(x_0\):

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \]

where \(\beta_t\) is the noise schedule. Using the cumulative parameter \(\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)\), one can directly sample noisy data at any time step \(t\) from \(x_0\):

\[ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]

That is:

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

Reverse process (denoising): Learns to gradually denoise from noise \(x_T \sim \mathcal{N}(0, I)\) back to data:

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I) \]

DDPM Training Objective

Through the simplified variational lower bound, the DDPM training objective is equivalent to noise prediction:

\[ \mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t \sim U[1,T], \, x_0, \, \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] \]

where \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), and \(\epsilon_\theta\) is the noise prediction network.

Derivation:

The reverse process posterior of DDPM is \(q(x_{t-1}|x_t, x_0)\), with mean:

\[ \tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \]

Substituting \(x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)\) gives the mean parameterized by \(\epsilon_\theta\):

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \]

Minimizing \(\|\tilde{\mu}_t - \mu_\theta\|^2\) is equivalent to minimizing \(\|\epsilon - \epsilon_\theta\|^2\) (up to coefficients).

DDIM Accelerated Sampling

DDPM requires \(T\) steps (typically 1000) of denoising, which is too slow. DDIM (Song et al., 2021) provides a non-Markovian sampling process that can use \(S \ll T\) steps:

\[ x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon \]

When \(\sigma_t = 0\), this gives deterministic sampling (DDIM); when \(\sigma_t = \sqrt{\beta_t}\), it degenerates to DDPM.

Diffusion Policy Formulation

Problem Setup

Given observation \(\mathbf{o}_t\) (images + robot state), predict the next \(T_a\) action steps:

\[ \mathbf{A}_t = (a_t, a_{t+1}, \ldots, a_{t+T_a-1}) \sim p_\theta(\mathbf{A}_t | \mathbf{O}_t) \]

where \(\mathbf{O}_t = (o_{t-T_o+1}, \ldots, o_t)\) is the observation history (\(T_o\) steps).

Conditional Denoising

The action sequence \(\mathbf{A}_t\) is treated as the "data" to be generated by the diffusion model. The conditional diffusion model is conditioned on observation \(\mathbf{O}_t\):

Training:

\[ \mathcal{L} = \mathbb{E}_{k, \mathbf{A}_t^0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_k} \mathbf{A}_t^0 + \sqrt{1-\bar{\alpha}_k} \epsilon, \, k, \, \mathbf{O}_t \right) \right\|^2 \right] \]

where \(k\) is the diffusion time step (distinct from the environment time step \(t\)), and \(\mathbf{A}_t^0\) is the expert action sequence.

Inference:

Sample \(\mathbf{A}_t^K \sim \mathcal{N}(0, I)\)
For \(k = K, K-1, \ldots, 1\):

\[ \mathbf{A}_t^{k-1} = \frac{1}{\sqrt{\alpha_k}} \left( \mathbf{A}_t^k - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(\mathbf{A}_t^k, k, \mathbf{O}_t) \right) + \sigma_k z \]

Execute the first \(T_e\) steps of \(\mathbf{A}_t^0\) (\(T_e \leq T_a\))

Action Space Parameters

Parameter	Typical Value	Description
Observation history \(T_o\)	2	Current frame + 1 frame of history
Prediction horizon \(T_a\)	16	Predict 16 future action steps
Execution horizon \(T_e\)	8	Execute 8 steps then re-predict
Denoising steps \(K\)	100 (training) / 10–16 (inference with DDIM)	Inference acceleration

Network Architecture

Observation Encoder

Visual encoder: Uses a pretrained ResNet-18 to extract image features:

\[ h_{\text{img}} = \text{ResNet18}(I_t) \in \mathbb{R}^{512} \]

State encoder: Linear projection of robot proprioceptive state:

\[ h_{\text{state}} = W_s \cdot [q_t, \dot{q}_t, p_t^{\text{ee}}] + b_s \]

Fusion: Concatenate visual and state features, then project:

\[ h_t = \text{MLP}([h_{\text{img}}, h_{\text{state}}]) \in \mathbb{R}^{d} \]

Noise Prediction Network

Diffusion Policy offers two architecture choices:

1. CNN-based (1D temporal convolution):

Treats the action sequence as a 1D signal, using a ResNet-style 1D convolutional network. Observations are injected via FiLM conditioning:

\[ \text{FiLM}(x, h) = \gamma(h) \odot x + \beta(h) \]

2. Transformer-based:

Uses a Transformer Decoder, with observation encodings as the key/value for cross-attention:

\[ \hat{\epsilon} = \text{TransformerDecoder}(\mathbf{A}_t^k, k, \mathbf{O}_t) \]

where the diffusion time step \(k\) is injected via sinusoidal positional encoding.

Architecture Comparison

Dimension	CNN-based	Transformer-based
Inference speed	Faster	Slower
Parameter count	~25M	~40M
Long sequence modeling	Limited by receptive field	Global attention
Tuning difficulty	Easier	Harder
Typical tasks	General	Long-horizon tasks

Why Diffusion Policy Outperforms Traditional BC

Multimodal Distribution Modeling

Traditional BC (MSE loss + deterministic network) can only output a single action. When the data contains multiple reasonable actions, MSE regression predicts the mean — which is typically not any of the reasonable actions.

Example: There is a cup on the table; one can go around it from the left or right side.

Traditional BC: Predicts the middle path (may collide with the cup)
Gaussian Mixture Model: Requires prespecifying the number of modes
Diffusion Policy: Naturally models arbitrarily complex multimodal distributions

Contact-Rich Tasks

In tasks involving contact (e.g., insertion, folding, pouring), tiny position differences lead to vastly different mechanical behavior. The iterative denoising process of diffusion policy can precisely generate action sequences that satisfy contact constraints.

Experimental Comparison (Chi et al., 2023)

Task	BC (MSE)	BC (GMM)	IBC	Diffusion Policy
Push-T	56.5%	65.1%	62.3%	86.2%
Can	78.5%	82.1%	80.4%	94.6%
Square	45.2%	51.8%	48.9%	84.7%
Transport	35.1%	42.3%	38.7%	73.8%

3D Diffusion Policy (DP3)

Motivation

Standard Diffusion Policy takes RGB images as input, but images lack 3D geometric information. DP3 (Ze et al., 2024) uses point clouds as input, providing explicit 3D spatial understanding.

Architecture

Point cloud encoding: Uses PointNet++ or similar networks to encode point clouds \(P \in \mathbb{R}^{N \times 3}\) into compact representations:

\[ h_{\text{3D}} = \text{PointNet++}(P) \in \mathbb{R}^{d} \]

Conditional diffusion: Same as standard Diffusion Policy, but using \(h_{\text{3D}}\) instead of \(h_{\text{img}}\) as the condition.

Advantages

Viewpoint invariance: 3D representation is naturally robust to camera viewpoint changes
Spatial reasoning: Explicitly encodes spatial relationships between objects
Depth information: Precise distance information is crucial for manipulation tasks

Experimental Results

DP3 surpasses RGB-based Diffusion Policy on multiple benchmarks, especially on tasks requiring precise spatial reasoning (e.g., assembly, alignment).

Consistency Policy: Fast Inference

Problem

Standard Diffusion Policy requires 10–100 denoising iterations for inference, each requiring one neural network forward pass. At a 50Hz control frequency:

16-step DDIM: ~50ms per inference (barely meets 20Hz)
100-step DDPM: ~300ms (far from meeting real-time requirements)

Consistency Model Review

Consistency Model (Song et al., 2023) learns a function \(f_\theta\) that directly maps any point on the diffusion trajectory to the starting point \(x_0\):

\[ f_\theta(x_t, t) = x_0 \quad \forall t \in [0, T] \]

Self-consistency condition: For any two points \(x_t\) and \(x_{t'}\) on the same trajectory, \(f_\theta(x_t, t) = f_\theta(x_{t'}, t')\).

Consistency Policy

Applying the Consistency Model to robot policies (Prasad et al., 2024):

Training objective (distillation approach):

\[ \mathcal{L}_{\text{CD}} = \mathbb{E}_{k, \mathbf{A}^0, \epsilon} \left[ d\left( f_\theta(\mathbf{A}^{t_{k+1}}, t_{k+1}, \mathbf{O}) , \, f_{\theta^-}(\hat{\mathbf{A}}^{t_k}, t_k, \mathbf{O}) \right) \right] \]

where \(\hat{\mathbf{A}}^{t_k}\) is the result of one-step denoising from \(\mathbf{A}^{t_{k+1}}\) using the pretrained diffusion model, and \(\theta^-\) is the EMA target network.

Inference: Only 1–2 steps are needed to generate high-quality action sequences:

\[ \mathbf{A}_t^0 = f_\theta(\mathbf{A}_t^T, T, \mathbf{O}_t), \quad \mathbf{A}_t^T \sim \mathcal{N}(0, I) \]

Speed Comparison

Method	Denoising Steps	Inference Time	Control Frequency
DDPM	100	~300 ms	~3 Hz
DDIM	16	~50 ms	~20 Hz
Consistency Policy	1–2	~5 ms	~200 Hz

Consistency Policy achieves 10–60x inference speedup, enabling diffusion policies for high-frequency control scenarios.

Implementation Tips

Action Normalization

Diffusion models assume data lies in the \([-1, 1]\) range. Actions need to be normalized:

\[ a_{\text{norm}} = \frac{2(a - a_{\min})}{a_{\max} - a_{\min}} - 1 \]

Observation and Action Windows

Important hyperparameters in practice:

Observation window: Use the most recent \(T_o = 2\) observations
Prediction window: Predict \(T_a = 16\) future action steps
Execution window: Execute \(T_e = 8\) steps then re-plan (receding horizon)

Overlapping execution provides smooth action transitions.

Training Tricks

EMA model: Use an exponential moving average model for inference
Learning rate schedule: Cosine annealing
Data augmentation: Random cropping, color jittering
Gradient clipping: Prevents training instability

Connections to Other Chapters

Imitation learning: Imitation Learning — BC is the baseline comparison method for diffusion policy
VLA models: VLA Models — some models also employ diffusion decoders
Teleoperation: Teleoperation and Data Collection provides the high-quality demonstration data needed by diffusion policy

References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models. ICLR.
Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.
Ze, Y., et al. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. RSS.
Song, Y., et al. (2023). Consistency Models. ICML.
Prasad, V., et al. (2024). Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation. RSS.