Skip to content

Diffusion Policy

Overview

Diffusion Policy (Chi et al., 2023) introduces Denoising Diffusion Probabilistic Models (DDPMs) into robot behavioral cloning, addressing the multimodal action distribution problem that traditional BC struggles with. In contact-rich manipulation tasks, diffusion policies demonstrate significantly superior performance compared to traditional methods.


Background: Denoising Diffusion Probabilistic Models

DDPM Review

DDPM (Ho et al., 2020) defines two processes:

Forward process (adding noise): Gradually adds Gaussian noise to data \(x_0\):

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \]

where \(\beta_t\) is the noise schedule. Using the cumulative parameter \(\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)\), one can directly sample noisy data at any time step \(t\) from \(x_0\):

\[ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \]

That is:

\[ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]

Reverse process (denoising): Learns to gradually denoise from noise \(x_T \sim \mathcal{N}(0, I)\) back to data:

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I) \]

DDPM Training Objective

Through the simplified variational lower bound, the DDPM training objective is equivalent to noise prediction:

\[ \mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t \sim U[1,T], \, x_0, \, \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] \]

where \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), and \(\epsilon_\theta\) is the noise prediction network.

Derivation:

The reverse process posterior of DDPM is \(q(x_{t-1}|x_t, x_0)\), with mean:

\[ \tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \]

Substituting \(x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)\) gives the mean parameterized by \(\epsilon_\theta\):

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \]

Minimizing \(\|\tilde{\mu}_t - \mu_\theta\|^2\) is equivalent to minimizing \(\|\epsilon - \epsilon_\theta\|^2\) (up to coefficients).

DDIM Accelerated Sampling

DDPM requires \(T\) steps (typically 1000) of denoising, which is too slow. DDIM (Song et al., 2021) provides a non-Markovian sampling process that can use \(S \ll T\) steps:

\[ x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon \]

When \(\sigma_t = 0\), this gives deterministic sampling (DDIM); when \(\sigma_t = \sqrt{\beta_t}\), it degenerates to DDPM.


Diffusion Policy Formulation

Problem Setup

Given observation \(\mathbf{o}_t\) (images + robot state), predict the next \(T_a\) action steps:

\[ \mathbf{A}_t = (a_t, a_{t+1}, \ldots, a_{t+T_a-1}) \sim p_\theta(\mathbf{A}_t | \mathbf{O}_t) \]

where \(\mathbf{O}_t = (o_{t-T_o+1}, \ldots, o_t)\) is the observation history (\(T_o\) steps).

Conditional Denoising

The action sequence \(\mathbf{A}_t\) is treated as the "data" to be generated by the diffusion model. The conditional diffusion model is conditioned on observation \(\mathbf{O}_t\):

Training:

\[ \mathcal{L} = \mathbb{E}_{k, \mathbf{A}_t^0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_k} \mathbf{A}_t^0 + \sqrt{1-\bar{\alpha}_k} \epsilon, \, k, \, \mathbf{O}_t \right) \right\|^2 \right] \]

where \(k\) is the diffusion time step (distinct from the environment time step \(t\)), and \(\mathbf{A}_t^0\) is the expert action sequence.

Inference:

  1. Sample \(\mathbf{A}_t^K \sim \mathcal{N}(0, I)\)
  2. For \(k = K, K-1, \ldots, 1\):
\[ \mathbf{A}_t^{k-1} = \frac{1}{\sqrt{\alpha_k}} \left( \mathbf{A}_t^k - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(\mathbf{A}_t^k, k, \mathbf{O}_t) \right) + \sigma_k z \]
  1. Execute the first \(T_e\) steps of \(\mathbf{A}_t^0\) (\(T_e \leq T_a\))

Action Space Parameters

Parameter Typical Value Description
Observation history \(T_o\) 2 Current frame + 1 frame of history
Prediction horizon \(T_a\) 16 Predict 16 future action steps
Execution horizon \(T_e\) 8 Execute 8 steps then re-predict
Denoising steps \(K\) 100 (training) / 10–16 (inference with DDIM) Inference acceleration

Network Architecture

Observation Encoder

Visual encoder: Uses a pretrained ResNet-18 to extract image features:

\[ h_{\text{img}} = \text{ResNet18}(I_t) \in \mathbb{R}^{512} \]

State encoder: Linear projection of robot proprioceptive state:

\[ h_{\text{state}} = W_s \cdot [q_t, \dot{q}_t, p_t^{\text{ee}}] + b_s \]

Fusion: Concatenate visual and state features, then project:

\[ h_t = \text{MLP}([h_{\text{img}}, h_{\text{state}}]) \in \mathbb{R}^{d} \]

Noise Prediction Network

Diffusion Policy offers two architecture choices:

1. CNN-based (1D temporal convolution):

Treats the action sequence as a 1D signal, using a ResNet-style 1D convolutional network. Observations are injected via FiLM conditioning:

\[ \text{FiLM}(x, h) = \gamma(h) \odot x + \beta(h) \]

2. Transformer-based:

Uses a Transformer Decoder, with observation encodings as the key/value for cross-attention:

\[ \hat{\epsilon} = \text{TransformerDecoder}(\mathbf{A}_t^k, k, \mathbf{O}_t) \]

where the diffusion time step \(k\) is injected via sinusoidal positional encoding.

Architecture Comparison

Dimension CNN-based Transformer-based
Inference speed Faster Slower
Parameter count ~25M ~40M
Long sequence modeling Limited by receptive field Global attention
Tuning difficulty Easier Harder
Typical tasks General Long-horizon tasks

Why Diffusion Policy Outperforms Traditional BC

Multimodal Distribution Modeling

Traditional BC (MSE loss + deterministic network) can only output a single action. When the data contains multiple reasonable actions, MSE regression predicts the mean — which is typically not any of the reasonable actions.

Example: There is a cup on the table; one can go around it from the left or right side.

  • Traditional BC: Predicts the middle path (may collide with the cup)
  • Gaussian Mixture Model: Requires prespecifying the number of modes
  • Diffusion Policy: Naturally models arbitrarily complex multimodal distributions

Contact-Rich Tasks

In tasks involving contact (e.g., insertion, folding, pouring), tiny position differences lead to vastly different mechanical behavior. The iterative denoising process of diffusion policy can precisely generate action sequences that satisfy contact constraints.

Experimental Comparison (Chi et al., 2023)

Task BC (MSE) BC (GMM) IBC Diffusion Policy
Push-T 56.5% 65.1% 62.3% 86.2%
Can 78.5% 82.1% 80.4% 94.6%
Square 45.2% 51.8% 48.9% 84.7%
Transport 35.1% 42.3% 38.7% 73.8%

3D Diffusion Policy (DP3)

Motivation

Standard Diffusion Policy takes RGB images as input, but images lack 3D geometric information. DP3 (Ze et al., 2024) uses point clouds as input, providing explicit 3D spatial understanding.

Architecture

Point cloud encoding: Uses PointNet++ or similar networks to encode point clouds \(P \in \mathbb{R}^{N \times 3}\) into compact representations:

\[ h_{\text{3D}} = \text{PointNet++}(P) \in \mathbb{R}^{d} \]

Conditional diffusion: Same as standard Diffusion Policy, but using \(h_{\text{3D}}\) instead of \(h_{\text{img}}\) as the condition.

Advantages

  • Viewpoint invariance: 3D representation is naturally robust to camera viewpoint changes
  • Spatial reasoning: Explicitly encodes spatial relationships between objects
  • Depth information: Precise distance information is crucial for manipulation tasks

Experimental Results

DP3 surpasses RGB-based Diffusion Policy on multiple benchmarks, especially on tasks requiring precise spatial reasoning (e.g., assembly, alignment).


Consistency Policy: Fast Inference

Problem

Standard Diffusion Policy requires 10–100 denoising iterations for inference, each requiring one neural network forward pass. At a 50Hz control frequency:

  • 16-step DDIM: ~50ms per inference (barely meets 20Hz)
  • 100-step DDPM: ~300ms (far from meeting real-time requirements)

Consistency Model Review

Consistency Model (Song et al., 2023) learns a function \(f_\theta\) that directly maps any point on the diffusion trajectory to the starting point \(x_0\):

\[ f_\theta(x_t, t) = x_0 \quad \forall t \in [0, T] \]

Self-consistency condition: For any two points \(x_t\) and \(x_{t'}\) on the same trajectory, \(f_\theta(x_t, t) = f_\theta(x_{t'}, t')\).

Consistency Policy

Applying the Consistency Model to robot policies (Prasad et al., 2024):

Training objective (distillation approach):

\[ \mathcal{L}_{\text{CD}} = \mathbb{E}_{k, \mathbf{A}^0, \epsilon} \left[ d\left( f_\theta(\mathbf{A}^{t_{k+1}}, t_{k+1}, \mathbf{O}) , \, f_{\theta^-}(\hat{\mathbf{A}}^{t_k}, t_k, \mathbf{O}) \right) \right] \]

where \(\hat{\mathbf{A}}^{t_k}\) is the result of one-step denoising from \(\mathbf{A}^{t_{k+1}}\) using the pretrained diffusion model, and \(\theta^-\) is the EMA target network.

Inference: Only 1–2 steps are needed to generate high-quality action sequences:

\[ \mathbf{A}_t^0 = f_\theta(\mathbf{A}_t^T, T, \mathbf{O}_t), \quad \mathbf{A}_t^T \sim \mathcal{N}(0, I) \]

Speed Comparison

Method Denoising Steps Inference Time Control Frequency
DDPM 100 ~300 ms ~3 Hz
DDIM 16 ~50 ms ~20 Hz
Consistency Policy 1–2 ~5 ms ~200 Hz

Consistency Policy achieves 10–60x inference speedup, enabling diffusion policies for high-frequency control scenarios.


Implementation Tips

Action Normalization

Diffusion models assume data lies in the \([-1, 1]\) range. Actions need to be normalized:

\[ a_{\text{norm}} = \frac{2(a - a_{\min})}{a_{\max} - a_{\min}} - 1 \]

Observation and Action Windows

Important hyperparameters in practice:

  • Observation window: Use the most recent \(T_o = 2\) observations
  • Prediction window: Predict \(T_a = 16\) future action steps
  • Execution window: Execute \(T_e = 8\) steps then re-plan (receding horizon)

Overlapping execution provides smooth action transitions.

Training Tricks

  1. EMA model: Use an exponential moving average model for inference
  2. Learning rate schedule: Cosine annealing
  3. Data augmentation: Random cropping, color jittering
  4. Gradient clipping: Prevents training instability

Connections to Other Chapters


References

  1. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
  2. Song, J., Meng, C., & Ermon, S. (2021). Denoising Diffusion Implicit Models. ICLR.
  3. Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.
  4. Ze, Y., et al. (2024). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. RSS.
  5. Song, Y., et al. (2023). Consistency Models. ICML.
  6. Prasad, V., et al. (2024). Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation. RSS.

评论 #