Diffusion Models

Diffusion models are a family of generative models built on the idea of iterative denoising. The core concept is remarkably elegant: first, define a forward process that gradually adds noise to data until it becomes pure Gaussian noise; then, train a neural network to learn the reverse process, recovering data step by step from noise. This "destroy and reconstruct" paradigm has achieved breakthrough results in image generation, video generation, audio synthesis, molecular design, and more, making diffusion models one of the most influential families of generative models today.

Background and Motivation

The Evolution of Generative Models

The central goal of generative models is to learn the data distribution \(p_{data}(x)\) and sample new data from it. Before the rise of diffusion models, the mainstream generative models belonged to three major families:

Model Family	Core Idea	Representative Work	Main Limitations
VAE (2013)	Learn latent variable distributions via encoder-decoder, maximize ELBO	Kingma & Welling, 2013	Generation quality limited by posterior approximation; images tend to be blurry
GAN (2014)	Adversarial game between generator and discriminator	Goodfellow et al., 2014	Unstable training, mode collapse, difficulty covering all modes
Flow (2014-2018)	Invertible transformations, exact likelihood computation	NICE, RealNVP, Glow	Architecture constraints (must be invertible), high computational cost

Each approach has clear shortcomings. VAE-generated images tend to be blurry because VAEs optimize a lower bound on reconstruction error. GANs can produce sharp images but suffer from unstable training and are prone to ignoring parts of the data distribution (mode collapse). Flow models require invertible architectures, which limits the network's expressive capacity.

Core Inspiration: Non-Equilibrium Thermodynamics

Diffusion models draw inspiration from non-equilibrium statistical thermodynamics. In 2015, Sohl-Dickstein et al. first systematically proposed this idea in the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics":

The physical intuition is as follows: imagine a drop of ink falling into a glass of clear water. Over time, the ink gradually diffuses until it is uniformly distributed throughout the water (corresponding to the forward noising process). If we could learn to "reverse" this diffusion process, we could recover the initial shape of the ink from its uniform distribution (corresponding to the reverse denoising process).

The key insight is that although the forward diffusion process destroys all structure in the data, each individual step of destruction is small and predictable. If noise is added slowly enough, the reverse of each step is approximately Gaussian and can therefore be parameterized by a neural network.

Advantages of Diffusion Models

Diffusion models surpass their predecessors for several fundamental reasons:

Training stability: No adversarial training is needed (a major pain point of GANs); the loss function is simple mean squared error, and training proceeds smoothly
Mode coverage: By maximizing a (lower bound on) likelihood, diffusion models naturally tend to cover all modes of the data distribution, avoiding mode collapse
Generation quality: With sufficiently many denoising steps, extremely high-quality samples can be generated
Flexibility: No special architectural constraints are needed (unlike Flow models, which require invertibility); any architecture can be used
Theoretical elegance: Deep connections to stochastic processes and score matching provide a solid mathematical foundation

Development Milestones

2015: Sohl-Dickstein et al. propose the original diffusion probabilistic model framework
2019: Song & Ermon propose Score-Based Generative Models (NCSN)
2020: Ho et al. propose DDPM, demonstrating that diffusion models can surpass GANs
2021: Nichol & Dhariwal propose Improved DDPM; Song et al. propose DDIM for accelerated sampling
2021: Dhariwal & Nichol publish "Diffusion Models Beat GANs", introducing Classifier Guidance
2022: Rombach et al. propose Latent Diffusion / Stable Diffusion; Google releases Imagen; OpenAI releases DALL-E 2
2023-2024: Diffusion models extend to video (Sora), 3D, audio, protein design, and other domains

DDPM in Detail

DDPM (Denoising Diffusion Probabilistic Models, Ho et al., 2020) is the foundational work on diffusion models. It transformed the 2015 theoretical framework into a practical generative model and, for the first time, matched GANs in image generation quality.

Overall Approach

DDPM defines two Markov chains that are inverses of each other:

Forward Process (Fixed, not learnable):
x_0 ──noising──> x_1 ──noising──> x_2 ──...──> x_{T-1} ──noising──> x_T ~ N(0, I)
(real data)                                                          (pure noise)

Reverse Process (Learned, needs training):
x_T ──denoising──> x_{T-1} ──denoising──> x_{T-2} ──...──> x_1 ──denoising──> x_0
(pure noise)                                                                (generated data)

Key characteristics of the overall architecture:

The forward process is predefined and requires no learnable parameters
The reverse process is driven by a parameterized neural network, which is the part we need to train
Typically \(T = 1000\), meaning the noising/denoising is completed in 1000 steps

Forward Process (Forward / Diffusion Process)

The forward process defines how clean data \(x_0\) is gradually transformed into noise. Each step adds a small amount of Gaussian noise to the result of the previous step:

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t \mathbf{I}) \]

where \(\beta_t \in (0, 1)\) is a predefined noise schedule that controls how much noise is added at each step.

Understanding this formula: At step \(t\), the new \(x_t\) is obtained by doing two things to \(x_{t-1}\):

Shrink the signal: Multiply by \(\sqrt{1 - \beta_t}\) (slightly less than 1) to slightly attenuate the original signal
Add noise: Add Gaussian noise with variance \(\beta_t\)

Intuitively, each step makes the data "a bit noisier." After sufficiently many steps, the information in the original data is completely drowned out by noise.

Noise Schedule

The choice of \(\beta_t\) has a significant impact on model performance. Two common scheduling strategies are:

Linear Schedule:

\[ \beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1) \]

The original DDPM paper uses \(\beta_1 = 10^{-4}\), \(\beta_T = 0.02\), \(T = 1000\).

Cosine Schedule:

Nichol & Dhariwal (2021) found that the linear schedule adds noise too aggressively near \(t = T\) and proposed the cosine schedule:

\[ \bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2 \]

where \(s = 0.008\) is a small offset to prevent \(\beta_t\) from being too small when \(t\) is close to 0.

Why is the cosine schedule better?

The problem with the linear schedule is that in the latter half of the process, \(\bar{\alpha}_t\) decreases too rapidly, causing data information to be destroyed prematurely. The cosine schedule makes the decline of \(\bar{\alpha}_t\) more gradual and uniform, ensuring that each step carries meaningful signal and that the network receives effective training signal at every timestep.

Direct Sampling at Any Timestep (Reparameterization Trick)

If training required stepping sequentially from \(x_0\) to \(x_t\), it would be extremely inefficient. Fortunately, by exploiting the additive property of Gaussian distributions, we can jump directly from \(x_0\) to any \(x_t\) in a single step.

Define:

\[ \alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s \]

\(\bar{\alpha}_t\) is the cumulative product of \(\alpha\) values. Since each \(\alpha_t\) is slightly less than 1, \(\bar{\alpha}_t\) decreases as \(t\) grows, eventually approaching 0.

Derivation:

From the forward process definition, we have \(x_t = \sqrt{\alpha_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon_{t-1}\), where \(\epsilon_{t-1} \sim \mathcal{N}(0, \mathbf{I})\).

Expanding recursively:

\[ \begin{aligned} x_t &= \sqrt{\alpha_t} \, x_{t-1} + \sqrt{1 - \alpha_t} \, \epsilon_{t-1} \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} \, x_{t-2} + \sqrt{1 - \alpha_{t-1}} \, \epsilon_{t-2}) + \sqrt{1 - \alpha_t} \, \epsilon_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \, x_{t-2} + \sqrt{\alpha_t(1 - \alpha_{t-1})} \, \epsilon_{t-2} + \sqrt{1 - \alpha_t} \, \epsilon_{t-1} \end{aligned} \]

Since the sum of two independent Gaussian noises is still Gaussian (with variances adding), the last two terms can be merged into a single Gaussian noise with variance \(\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t\alpha_{t-1}\).

Continuing the recursion all the way to \(x_0\), we ultimately obtain:

\[ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, \, (1 - \bar{\alpha}_t) \mathbf{I}) \]

Equivalently, expressed using reparameterization:

\[ x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \]

Importance of Reparameterization

This formula is the key to DDPM's training efficiency. During training, we do not need to simulate the forward process step by step. Instead, we simply: (1) randomly sample a timestep \(t\); (2) randomly sample noise \(\epsilon\); (3) directly compute \(x_t\) using the formula above. This allows training to proceed in parallel across arbitrary timesteps.

Intuition behind \(\bar{\alpha}_t\):

When \(t\) is small, \(\bar{\alpha}_t \approx 1\), so \(x_t \approx x_0\) — the data is nearly undisturbed
When \(t\) is large, \(\bar{\alpha}_t \approx 0\), so \(x_t \approx \epsilon\) — the data has become pure noise
\(\bar{\alpha}_t\) can be understood as the proportion of signal retained at time \(t\)

The following ASCII diagram illustrates the change in \(\bar{\alpha}_t\) and image quality during the forward noising process:

bar_alpha_t:  1.0           0.7           0.3           0.05          ~0
              |             |             |             |             |
t:            0            250           500           750          1000
              |             |             |             |             |
图像质量:   [清晰原图] → [轻微噪点] → [模糊可辨] → [几乎纯噪声] → [纯高斯噪声]

Reverse Process (Reverse / Denoising Process)

The goal of the reverse process is to start from \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and progressively denoise, ultimately generating data \(x_0\) that closely follows the true distribution.

The true reverse conditional distribution \(q(x_{t-1} | x_t)\) depends on the entire data distribution \(q(x_0)\) and cannot be computed directly. However, when \(\beta_t\) is sufficiently small, \(q(x_{t-1} | x_t)\) is approximately Gaussian. We therefore approximate it with a parameterized Gaussian distribution:

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I}) \]

where \(\mu_\theta(x_t, t)\) is the mean output by a neural network, and \(\sigma_t^2\) is the variance (which can be fixed or learned).

Predicting Noise vs. Predicting the Mean

A key finding by Ho et al. is that rather than having the network directly predict the mean \(\mu_\theta\), it is better to have the network predict the noise \(\epsilon_\theta(x_t, t)\).

This is because, using Bayes' theorem, we can derive the true reverse conditional distribution given \(x_0\):

\[ q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I}) \]

where:

\[ \tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \]

\[ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t \]

Since \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), we can express \(x_0\) in terms of \(x_t\) and \(\epsilon\):

\[ x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1 - \bar{\alpha}_t} \, \epsilon) \]

Substituting this into the formula for \(\tilde{\mu}_t\) and simplifying yields:

\[ \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right) \]

Therefore, if our network \(\epsilon_\theta(x_t, t)\) can accurately predict the noise \(\epsilon\) added to \(x_0\), the mean of the reverse process becomes:

\[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \]

Three Equivalent Prediction Targets

The network can be trained to predict three different but equivalent quantities:

Predict noise \(\epsilon_\theta(x_t, t) \approx \epsilon\): The default choice in DDPM, works best in practice
Predict original data \(x_\theta(x_t, t) \approx x_0\): More intuitive in certain applications
Predict score \(s_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t)\): A bridge to score-based methods

The relationship among the three is: \(\epsilon = -\sqrt{1 - \bar{\alpha}_t} \nabla_{x_t} \log q(x_t | x_0)\)

Variance Selection

The variance \(\sigma_t^2\) in DDPM has two common choices:

\(\sigma_t^2 = \beta_t\): Corresponds to maximizing the entropy of the reverse process
\(\sigma_t^2 = \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}\beta_t\): Corresponds to the variance of the true reverse conditional distribution

Ho et al. found that both choices yield similar results. Improved DDPM (Nichol & Dhariwal, 2021) proposes learning an interpolation between the two:

\[ \sigma_t^2 = \exp(v \log \tilde{\beta}_t + (1 - v) \log \beta_t) \]

where \(v\) is a network output that interpolates between the two extremes in log-space.

Architecture Diagram

The following ASCII diagram illustrates the complete forward-reverse process of DDPM:

=========================== 前向过程 q (固定) ============================
                  q(x_1|x_0)     q(x_2|x_1)             q(x_T|x_{T-1})
  x_0 ──────────> x_1 ──────────> x_2 ──── ... ────> x_{T-1} ──────────> x_T
 (数据)          (+少量噪声)     (+少量噪声)                              (纯噪声)
                                                                        ~ N(0, I)
  信号强度: ████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  噪声强度: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████████████████████████

========================= 反向过程 p_theta (学习) =========================
                  p(x_0|x_1)     p(x_1|x_2)             p(x_{T-1}|x_T)
  x_0 <────────── x_1 <────────── x_2 <──── ... ──── x_{T-1} <────────── x_T
 (生成数据)      (去噪一步)      (去噪一步)                             (采样噪声)
                    ^               ^                      ^
                    |               |                      |
             epsilon_theta    epsilon_theta          epsilon_theta
             (神经网络)       (神经网络)              (神经网络)
                    |               |                      |
                 [U-Net]         [U-Net]                [U-Net]
                (共享权重)       (共享权重)             (共享权重)

U-Net Architecture

DDPM uses a modified U-Net as the noise prediction network \(\epsilon_\theta(x_t, t)\). The core design elements include:

Timestep Embedding:

The timestep \(t\) is converted into an embedding vector via sinusoidal positional encoding (similar to Transformer) and then injected into each ResBlock:

\[ \text{emb}(t) = \text{MLP}(\text{SinusoidalPE}(t)) \]

U-Net Structure Diagram:

输入 x_t (带噪图像)
  |
  v
[Conv] ──────────────────────────────────────────────> [+] → [Conv] → 输出 epsilon
  |                                                    ^
  v                                                    |
[ResBlock + t_emb] ─── skip connection ──────────> [ResBlock + t_emb]
  |                                                    ^
  v                                                    |
[ResBlock + t_emb + Attn] ─── skip ──────────> [ResBlock + t_emb + Attn]
  |                                                    ^
  v                                                    |
[Downsample]                                     [Upsample]
  |                                                    ^
  v                                                    |
[ResBlock + t_emb + Attn] ─── skip ──────────> [ResBlock + t_emb + Attn]
  |                                                    ^
  v                                                    |
[Downsample]                                     [Upsample]
  |                                                    ^
  v                                                    |
  +──────> [ResBlock + Attn] ──> [ResBlock] ──────────>+
                    (Bottleneck 瓶颈层)

Key components:

ResBlock: Residual blocks that receive the timestep embedding \(t_{\text{emb}}\) as a conditioning signal
Self-Attention: Applied on lower-resolution feature maps to capture global dependencies
Skip Connection: The hallmark design of U-Net, connecting corresponding encoder and decoder layers
Input/Output: The input is the noisy image \(x_t\), and the output is the predicted noise \(\epsilon_\theta\); both have the same spatial dimensions

Training Objective

From the Variational Lower Bound to the Simplified Loss

The training objective of DDPM can be derived from the variational lower bound (ELBO) on the log-likelihood. The full variational lower bound is:

\[ \mathcal{L}_{\text{VLB}} = \mathbb{E}_q \left[ \underbrace{D_{KL}(q(x_T|x_0) \| p(x_T))}_{L_T} + \sum_{t=2}^{T} \underbrace{D_{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t))}_{L_{t-1}} - \underbrace{\log p_\theta(x_0|x_1)}_{L_0} \right] \]

where:

\(L_T\): KL divergence between the final distribution of the forward process and the prior \(\mathcal{N}(0, \mathbf{I})\); contains no learnable parameters and is a constant
\(L_{t-1}\) (\(t = 2, ..., T\)): KL divergence comparing the true reverse conditional distribution with the model's prediction
\(L_0\): Reconstruction term measuring the quality of reconstructing \(x_0\) from \(x_1\)

Since both \(q(x_{t-1}|x_t, x_0)\) and \(p_\theta(x_{t-1}|x_t)\) are Gaussian distributions, the KL divergence between them can be computed analytically and ultimately reduces to the norm of the difference between their means.

One of Ho et al.'s key contributions is the discovery that the following simplified loss function works better in practice:

\[ L_{\text{simple}} = \mathbb{E}_{t \sim U[1,T], \, x_0 \sim q(x_0), \, \epsilon \sim \mathcal{N}(0, \mathbf{I})} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] \]

where \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\).

Intuitive understanding: The meaning of this loss function is very simple — we know what noise \(\epsilon\) was added, we ask the network to predict that noise \(\epsilon_\theta\), and then we minimize the mean squared error between the two.

Why does the simplified loss work better than the VLB?

In the full VLB, each \(L_{t-1}\) term is preceded by a weighting coefficient that depends on \(t\). The simplified loss removes this weighting (equivalently assigning equal weight to all timesteps), which effectively gives more weight to smaller values of \(t\) (the final denoising steps). Experiments show that this reweighting improves sample quality, even though it is no longer a strict lower bound on the likelihood.

Training Algorithm

The pseudocode for the training process is as follows:

Algorithm: DDPM Training
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: Dataset, noise prediction network epsilon_theta, noise schedule {alpha_t, bar_alpha_t}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
repeat:
    1. Sample x_0 ~ q(x_0) from the dataset
    2. Randomly sample timestep t ~ Uniform{1, 2, ..., T}
    3. Randomly sample noise epsilon ~ N(0, I)
    4. Construct noisy sample: x_t = sqrt(bar_alpha_t) * x_0 + sqrt(1 - bar_alpha_t) * epsilon
    5. Compute loss: L = || epsilon - epsilon_theta(x_t, t) ||^2
    6. Gradient descent on theta: theta <- theta - eta * grad_theta(L)
until convergence

Training Efficiency

Note that each training iteration only needs to process a single timestep, rather than the entire chain of \(T\) steps. This is because we can jump directly to any timestep using \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), without simulating the forward process step by step. This makes training efficiency comparable to standard supervised learning.

Sampling Process

After training the model, the process for generating new samples is as follows:

Algorithm: DDPM Sampling
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: Trained epsilon_theta, noise schedule {alpha_t, beta_t, bar_alpha_t}
Output: Generated sample x_0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Sample initial noise: x_T ~ N(0, I)
2. for t = T, T-1, ..., 1:
       a. if t > 1: sample z ~ N(0, I)
          else:     z = 0
       b. Predict noise with network: epsilon_pred = epsilon_theta(x_t, t)
       c. Denoise one step:
          x_{t-1} = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1 - bar_alpha_t) * epsilon_pred) + sigma_t * z
3. return x_0

where \(\sigma_t = \sqrt{\beta_t}\) or \(\sigma_t = \sqrt{\tilde{\beta}_t}\).

Important notes:

The final step (\(t = 1\)) does not add noise (\(z = 0\)); otherwise, the generated image would have residual noise
Sampling requires \(T\) forward passes (typically \(T = 1000\)), which is the main bottleneck of DDPM
The computational cost of each step equals one U-Net forward pass

Sampling Speed Problem

The critical weakness of DDPM is its sampling speed. Generating a single \(256 \times 256\) image requires 1000 U-Net forward passes, which can take anywhere from tens of seconds to several minutes on a GPU. In contrast, a GAN requires only a single forward pass. This has spawned a large body of follow-up work on accelerating sampling.

DDIM: Accelerated Sampling

Song et al. (2021) proposed DDIM (Denoising Diffusion Implicit Models), which can dramatically speed up sampling without retraining the model.

Core Idea

The forward process in DDPM is a Markov chain: \(x_t\) depends only on \(x_{t-1}\). The key insight of DDIM is that as long as the marginal distributions \(q(x_t | x_0)\) remain unchanged, the forward process does not have to be Markovian.

DDIM constructs a family of non-Markovian forward processes such that:

The marginal distribution \(q_\sigma(x_t | x_0)\) at each step is identical to that of DDPM
But the joint distribution \(q_\sigma(x_{1:T} | x_0)\) differs from DDPM
The resulting reverse process allows skipping steps during sampling

DDIM Update Rule

The reverse process update formula for DDIM is:

\[ x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \underbrace{\sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \, \epsilon_\theta(x_t, t)}_{\text{direction pointing to } x_t} + \underbrace{\sigma_t \, \epsilon_t}_{\text{random noise}} \]

where \(\epsilon_t \sim \mathcal{N}(0, \mathbf{I})\), and \(\sigma_t\) is an adjustable parameter:

When \(\sigma_t = \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}} \sqrt{1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}\), it reduces to DDPM
When \(\sigma_t = 0\), we get deterministic sampling, i.e., DDIM

Deterministic vs. Stochastic Sampling

When \(\sigma_t = 0\), the DDIM update becomes entirely deterministic:

\[ x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \left( \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \, \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right) + \sqrt{1 - \bar{\alpha}_{t-1}} \, \epsilon_\theta(x_t, t) \]

Advantages of deterministic sampling:

Step skipping: Since there is no randomness, sampling can be performed on a subsequence of timesteps \(\tau_1, \tau_2, ..., \tau_S\) (\(S \ll T\)), e.g., using only 50 steps instead of 1000
Consistent latent encoding: The same \(x_T\) always maps to the same \(x_0\), giving the latent space semantic meaning
Reversibility: Both the forward and reverse processes are deterministic, allowing any image to be encoded into the latent space (similar to GAN inversion)

Acceleration Results

Sampling Steps	DDPM	DDIM	Notes
1000 steps	FID ~3.17	FID ~4.04	Nearly equivalent
100 steps	Cannot be used directly	FID ~6.84	10x speedup
50 steps	Cannot be used directly	FID ~8.23	20x speedup
20 steps	Cannot be used directly	FID ~16.7	50x speedup, quality still acceptable

Subsequent Acceleration Methods

DDIM pioneered the field of accelerated sampling. Many more methods have emerged since:

DPM-Solver (Lu et al., 2022): Leverages high-order ODE solvers, achieving high-quality samples in 10-20 steps
Consistency Models (Song et al., 2023): Trains the model to directly map to the starting point of the trajectory, enabling one-step generation
Progressive Distillation (Salimans & Ho, 2022): Uses knowledge distillation to progressively halve the number of sampling steps
Rectified Flow / Flow Matching: Learns straighter transport paths, reducing the number of required steps

The Score-Based Perspective

Diffusion models also have an equally important alternative interpretation: Score-Based Generative Models. This perspective was introduced by Song & Ermon (2019) and later unified within the SDE framework.

Score Function

The score function is defined as the gradient of the log probability density of the data:

\[ s(x) = \nabla_x \log p(x) \]

The score function points in the direction of steepest increase in data density. If we know the score function, we can sample from the distribution using Langevin dynamics:

\[ x_{k+1} = x_k + \frac{\delta}{2} \nabla_x \log p(x_k) + \sqrt{\delta} \, z_k, \quad z_k \sim \mathcal{N}(0, \mathbf{I}) \]

As \(\delta \to 0\) and \(k \to \infty\), \(x_k\) converges to a sample from \(p(x)\).

Score Matching

The problem is that we do not know the true \(\nabla_x \log p(x)\). Score Matching (Hyvarinen, 2005) provides a method to estimate the score function without needing to know the normalization constant.

We train a score network \(s_\theta(x) \approx \nabla_x \log p(x)\) by minimizing:

\[ \mathbb{E}_{p(x)} \left[ \| s_\theta(x) - \nabla_x \log p(x) \|^2 \right] \]

This objective can be transformed via integration by parts into a form that does not require knowledge of \(\nabla_x \log p(x)\) (i.e., implicit score matching or denoising score matching).

Denoising Score Matching

Vincent (2011) proved a key result: if we train the score network on noised data, i.e.:

\[ \mathbb{E}_{q_\sigma(\tilde{x}|x)p(x)} \left[ \| s_\theta(\tilde{x}) - \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) \|^2 \right] \]

When \(q_\sigma(\tilde{x}|x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I)\), we have \(\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2} = -\frac{\epsilon}{\sigma}\).

This establishes the connection between score matching and denoising: estimating the score function is equivalent to predicting the noise.

Equivalence with DDPM

There is a precise equivalence between DDPM's noise prediction objective and score matching:

\[ \nabla_{x_t} \log q(x_t | x_0) = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}} \]

Therefore:

\[ \epsilon_\theta(x_t, t) = -\sqrt{1 - \bar{\alpha}_t} \, s_\theta(x_t, t) \]

DDPM's noise prediction network \(\epsilon_\theta\) is essentially estimating the score function of the noised data (up to a scaling factor).

The Unified SDE Framework

Song et al. (2021, "Score-Based Generative Modeling through Stochastic Differential Equations") generalized the discrete diffusion process to a continuous stochastic differential equation (SDE), providing a unified framework.

Forward SDE:

\[ dx = f(x, t) \, dt + g(t) \, dw \]

where \(f\) is the drift coefficient, \(g\) is the diffusion coefficient, and \(w\) is the standard Wiener process.

Reverse SDE (Anderson, 1982):

\[ dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{w} \]

Key finding: the reverse SDE is entirely determined by the score function \(\nabla_x \log p_t(x)\). Therefore, once the score network is trained, any SDE solver can be used for sampling.

Different SDEs correspond to different models:

SDE Type	\(f(x, t)\)	\(g(t)\)	Corresponding Model
VP-SDE	\(-\frac{1}{2}\beta(t)x\)	\(\sqrt{\beta(t)}\)	DDPM
VE-SDE	\(0\)	\(\sqrt{\frac{d[\sigma^2(t)]}{dt}}\)	NCSN / SMLD
sub-VP SDE	\(-\frac{1}{2}\beta(t)x\)	\(\sqrt{\beta(t)(1-e^{-2\int_0^t \beta(s)ds})}\)	Improved variant

Probability Flow ODE

For every SDE, there exists a corresponding ordinary differential equation (ODE) whose trajectories have the same marginal distributions as those of the SDE:

\[ dx = \left[ f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x) \right] dt \]

This ODE is called the Probability Flow ODE. It is deterministic (no stochastic term), making exact likelihood computation possible and enabling high-order ODE solvers (such as DPM-Solver) to accelerate sampling. DDIM can be viewed as a discretization of the Probability Flow ODE.

Conditional Generation

Unconditional diffusion models learn \(p(x)\), but in practice we typically need conditional generation, i.e., sampling from \(p(x|c)\), where \(c\) is conditioning information (such as a text description, class label, low-resolution image, etc.).

Classifier Guidance

Dhariwal & Nichol (2021) proposed the Classifier Guidance method. The core idea is to use a classifier trained on noisy data to guide the diffusion process.

By Bayes' theorem:

\[ \nabla_x \log p(x | c) = \nabla_x \log p(x) + \nabla_x \log p(c | x) \]

That is, conditional score = unconditional score + classifier gradient. During sampling:

\[ \hat{\epsilon}(x_t, t) = \epsilon_\theta(x_t, t) - \sqrt{1 - \bar{\alpha}_t} \, w \, \nabla_{x_t} \log p_\phi(c | x_t) \]

where \(w\) is the guidance scale:

\(w = 0\): Unconditional generation
\(w > 0\): Guides toward class \(c\); larger \(w\) produces more "typical" outputs but reduces diversity

Drawbacks:

Requires training a separate classifier \(p_\phi(c | x_t)\), which must be trained on noisy data
Classifier gradients may introduce adversarial-example-like biases

Classifier-Free Guidance (CFG)

Ho & Salimans (2022) proposed the more elegant Classifier-Free Guidance, which eliminates the need for an external classifier entirely.

Training phase: Train a conditional diffusion model \(\epsilon_\theta(x_t, t, c)\), but during training, randomly replace the condition \(c\) with a null condition \(\emptyset\) with some probability (typically 10%-20%), i.e., dropout on the conditioning information. This way, a single network learns both conditional and unconditional generation simultaneously.

Inference phase: Linearly combine the conditional and unconditional predictions:

\[ \hat{\epsilon}(x_t, t, c) = (1 + w) \, \epsilon_\theta(x_t, t, c) - w \, \epsilon_\theta(x_t, t, \emptyset) \]

Equivalent form (more commonly used):

\[ \hat{\epsilon}(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + (1 + w) \left[ \epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset) \right] \]

Intuitive understanding: The unconditional prediction tells us "what a generic image looks like," while the conditional prediction tells us "what an image satisfying condition \(c\) looks like." CFG extrapolates between the two: it not only moves toward the conditional direction but overshoots it, moving further away from the unconditional direction.

Choosing the Guidance Scale

The guidance scale \(w\) (or in some literature \(s = 1 + w\)) is crucial for generation quality:

\(w = 0\) (\(s = 1\)): Standard conditional generation; high diversity but potentially insufficient alignment with the condition
\(w = 1 \sim 3\): A good balance
\(w = 7 \sim 15\): Typical setting for Stable Diffusion and similar models; high text-image alignment
\(w\) too large: Over-saturated and distorted outputs; diversity drops sharply

There is a diversity-fidelity trade-off: as guidance scale increases, FID first decreases then increases, while CLIP Score (text-image alignment) continues to rise.

Advantages of CFG

Compared to Classifier Guidance, CFG offers the following advantages:

No need to train a separate classifier
Applicable to any type of condition (text, images, audio, etc.), not just discrete class labels
Simpler training — only requires conditioning dropout on the existing model
Better results in practice; it has become the standard method for conditional diffusion models

Latent Diffusion (Stable Diffusion)

Rombach et al. (2022) proposed the Latent Diffusion Model (LDM), whose open-source implementation is the well-known Stable Diffusion. The core idea of LDM is to perform diffusion in latent space rather than pixel space.

Motivation

Performing diffusion directly in pixel space has two problems:

High computational cost: A \(512 \times 512 \times 3\) image has 786,432 dimensions, and the U-Net must perform \(T\) forward passes in this high-dimensional space
Entanglement of semantics and details: Pixel space simultaneously contains high-level semantic information and low-level perceptual details (textures, high-frequency noise), and the diffusion model must learn both

LDM's solution is a two-stage approach:

Stage one: Train an autoencoder (typically a VAE) to compress images into a low-dimensional latent space
Stage two: Train the diffusion model in the low-dimensional latent space

Overall Architecture

                       Latent Diffusion Model 架构
==========================================================================

文本输入: "一只穿着太空服的猫"
     |
     v
[CLIP Text Encoder]  ──>  文本嵌入 c (77 x 768)
                                |
                                | (Cross-Attention)
                                v
  z_T ~ N(0,I)  ──>  [U-Net (带 Cross-Attention)]  ──>  z_0 (潜变量)
  (潜空间噪声)          反复迭代 T 步去噪                     |
  (64 x 64 x 4)                                           v
                                                    [VAE Decoder]
                                                          |
                                                          v
                                                    生成图像 x_0
                                                   (512 x 512 x 3)

==========================================================================
编码 (仅训练第一阶段时使用):
  输入图像 x  ──>  [VAE Encoder]  ──>  z = E(x)  (64 x 64 x 4)
  压缩比: 512x512x3 = 786432 维  →  64x64x4 = 16384 维  (约 48 倍压缩)

Stage One: Perceptual Compression

The autoencoder's training objective includes:

Reconstruction loss: \(L_{\text{rec}} = \| x - D(E(x)) \|\)
Perceptual loss: Compares feature-level differences using a pretrained VGG network (LPIPS)
Adversarial loss: A discriminator ensures the realism of reconstructed images
KL regularization (mild): \(L_{\text{KL}} = D_{KL}(q(z|x) \| \mathcal{N}(0, \mathbf{I}))\), with very small weight

Choice of KL Weight

In the LDM paper, the KL regularization weight is approximately \(10^{-6}\), which is very small. This means the latent space distribution is not forced to be standard normal as in a standard VAE, but is instead free to encode information more flexibly. This is important for reconstruction quality. An alternative approach is to use VQ-VAE (vector quantization), replacing KL regularization with a discrete codebook.

Stage Two: Latent Space Diffusion

The diffusion model is trained in the latent space of the pretrained autoencoder. The process is identical to standard DDPM, except the objects of operation are latent variables instead of pixels:

\[ L_{\text{LDM}} = \mathbb{E}_{t, z_0 = E(x_0), \epsilon} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c) \|^2 \right] \]

where \(z_t = \sqrt{\bar{\alpha}_t} z_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\), and \(c\) is the conditioning information.

U-Net + Cross-Attention

The LDM U-Net adds Cross-Attention layers on top of the standard architecture to inject conditioning information:

U-Net 中每个 Block 的结构:
┌──────────────────────────────────┐
│  ResBlock (+ Timestep Embedding) │
│          |                       │
│  Self-Attention                  │    Q, K, V 都来自图像特征
│          |                       │
│  Cross-Attention                 │    Q 来自图像特征
│     Q = W_Q * z_feat             │    K, V 来自文本嵌入 c
│     K = W_K * c                  │
│     V = W_V * c                  │
│          |                       │
│  FFN (Feed-Forward Network)      │
└──────────────────────────────────┘

Cross-Attention computation:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V \]

where \(Q = W_Q \cdot \varphi(z_t)\) (projection of image features), \(K = W_K \cdot \tau_\theta(c)\), \(V = W_V \cdot \tau_\theta(c)\) (projections of the conditioning embeddings).

Text Conditioning: CLIP Text Encoder

Stable Diffusion uses the text encoder from CLIP (Contrastive Language-Image Pre-training) to convert text prompts into embedding vectors:

Text input is tokenized, truncated or padded to 77 tokens
The CLIP Text Encoder (Transformer architecture) encodes the token sequence into a \(77 \times 768\) embedding matrix
This embedding is injected into the U-Net via Cross-Attention

Different Versions of Stable Diffusion

SD 1.x: Uses the CLIP ViT-L/14 text encoder (768-dimensional)
SD 2.x: Uses the OpenCLIP ViT-H/14 text encoder (1024-dimensional)
SDXL: Uses both CLIP ViT-L and OpenCLIP ViT-bigG, concatenating both outputs
SD 3.x: Introduces the MMDiT architecture (replacing U-Net with DiT), using T5-XXL + CLIP dual encoders

Advantages of Latent Diffusion

Aspect	Pixel-Space Diffusion	Latent Diffusion
Spatial dimensions	\(512 \times 512 \times 3 = 786K\)	\(64 \times 64 \times 4 = 16K\)
Training cost	Extremely high (thousands of GPU-days)	Significantly reduced (hundreds of GPU-days)
Inference speed	Slow	~4-8x faster
Usable on consumer GPUs	Nearly impossible	Yes (8GB VRAM sufficient for inference)
Conditioning injection	Difficult to design flexibly	Cross-Attention is flexible and general

Diffusion Transformer (DiT)

As Transformers have consolidated their dominance across domains, replacing U-Net with Transformer as the backbone of diffusion models has become a natural trend. Peebles & Xie (2023) proposed DiT (Diffusion Transformer).

Core Modifications

DiT divides image latent variables into patches (similar to ViT) and then processes them with a standard Transformer:

潜变量 z_t (32x32x4)
    |
    v
[Patchify: 将 z 划分为 patch 序列] ──> (256 个 patch, 每个 2x2x4 = 16 维)
    |
    v
[线性投影 + 位置编码] ──> (256 个 token, d 维)
    |
    v
[DiT Block x N]
    |  - Self-Attention
    |  - Cross-Attention (或 adaLN-Zero 条件注入)
    |  - FFN
    |  - 条件信息 (t, c) 通过 adaLN-Zero 注入
    |
    v
[线性解码] ──> 预测噪声 epsilon (32x32x4)

adaLN-Zero (Adaptive Layer Norm - Zero): DiT's conditioning injection mechanism. The timestep \(t\) and class label \(c\) are encoded, and an MLP generates the scale and shift parameters for LayerNorm as well as the gate parameters for the residual connection.

Significance of DiT

DiT demonstrates that the Transformer architecture is equally effective in diffusion models, with better scaling properties. This laid the groundwork for subsequent large-scale diffusion models (such as Sora and SD3's MMDiT). DiT also validates that diffusion model performance continues to improve with model scale and training compute (scaling law).

Model Comparison

Diffusion vs GAN vs VAE

Dimension	VAE	GAN	Diffusion
Generation quality	Moderate; images tend to be blurry	High; sharp images	Highest; rich details
Training stability	Stable	Unstable; requires careful tuning	Very stable
Mode coverage	Good (maximizes likelihood)	Poor (mode collapse)	Good (maximizes likelihood lower bound)
Sampling speed	Fast (single forward pass)	Fast (single forward pass)	Slow (requires multi-step iteration)
Likelihood computation	ELBO lower bound	Cannot be computed directly	ELBO lower bound (exact via ODE)
Latent space	Meaningful continuous latent space	Weak latent space structure	DDIM provides meaningful latent space
Conditional generation	Via conditional encoder	Via conditional discriminator/generator	CFG; very flexible
Architecture constraints	Encoder-decoder	Generator + discriminator	Any (U-Net, DiT, ...)
Theoretical foundation	Variational inference	Game theory / Wasserstein distance	Stochastic processes / score matching
Representative applications	Data compression, representation learning	Image super-resolution, style transfer	Text-to-image, video generation, molecular design

Comparison of Different Diffusion Model Variants

Method	Process Type	Sampling Steps	Key Features
DDPM	Stochastic, Markovian	~1000	Original method; simple but slow
DDIM	Deterministic/stochastic, non-Markovian	10-100	Step-skipping sampling, reversible encoding
Score SDE	Continuous SDE	Flexible	Unified framework, theoretically elegant
LDM	Latent space diffusion	20-50	Computationally efficient, scalable
DiT	Transformer backbone	20-50	Good scaling properties
Consistency Models	Single/few-step	1-2	Distilled or directly trained

Discussion and Reflections

Why Can Diffusion Models Surpass GANs?

This question can be understood from multiple perspectives:

Fundamental differences in training objectives: GAN training is a minimax game \(\min_G \max_D V(D, G)\), searching for a Nash equilibrium. But in high-dimensional spaces, the Nash equilibrium may not exist or may be unstable. Diffusion training is a simple regression problem (minimizing MSE), with a much smoother loss landscape.
Mode covering vs. mode seeking: A GAN's discriminator can only distinguish "real" from "fake," so the generator tends to find a few modes that can fool the discriminator (mode seeking). Diffusion models maximize a (lower bound on) likelihood, naturally tending to cover all modes (mode covering).
Multi-scale denoising: Diffusion models operate at different noise levels. At high noise levels they learn global structure (composition, color), and at low noise levels they learn local details (textures, edges). This coarse-to-fine generation process naturally decomposes the difficulty.
Evolution of evaluation metrics: Early GAN papers primarily used IS (Inception Score) and FID for evaluation, which have limitations. When more comprehensive evaluations are used (e.g., precision-recall, diversity), the advantages of diffusion models become more apparent.

Addressing the Speed Problem

Sampling speed is the biggest bottleneck of diffusion models. Current solutions follow four main directions:

Better solvers: DDIM, DPM-Solver, etc. leverage ODE/SDE numerical solving theory, using higher-order methods to reduce the number of steps
Distillation: Progressive Distillation distills the model into a version requiring fewer steps; Consistency Distillation trains the model to directly jump to the endpoint of the trajectory
Latent space compression: LDM reduces per-step computation by operating in a lower-dimensional space
Architecture optimization: More efficient attention mechanisms, model pruning, quantization, and other engineering techniques

Relationship Between Diffusion and Flow Matching

Flow Matching (Lipman et al., 2023) is a recently emerging generative modeling method with deep connections to diffusion models:

Similarities:

Both transform from a noise distribution to a data distribution
Both can be expressed in ODE form
Both require training a vector field / score network

Key differences:

Diffusion's transport paths are curved (determined by the diffusion process), while Flow Matching can learn straighter paths (optimal transport)
Straighter paths mean the ODE is easier to solve, requiring fewer steps for good results
Flow Matching's training objective is more concise, directly regressing the velocity field \(v_t(x)\): \(L = \mathbb{E}_{t, x_0, x_1}\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\)

Flow Matching can be viewed as the "next generation" of diffusion models: it retains the training stability advantages of diffusion models while addressing the sampling speed problem through straighter transport paths. State-of-the-art models such as Stable Diffusion 3 and Flux have already adopted the Flow Matching training paradigm.

Frontier Directions for Diffusion Models

Video generation: Sora (OpenAI), Veo (Google), etc. extend diffusion models to the spatiotemporal domain for high-quality video generation
3D generation: Using diffusion models to generate 3D models, NeRFs, and Gaussian splatting scenes
Scientific applications: Protein structure design (RFdiffusion), molecular generation, weather forecasting (GenCast)
World models: Using diffusion models as world simulators for robot planning and reinforcement learning
Controllable generation: ControlNet, IP-Adapter, and other methods for finer-grained generation control
Efficiency optimization: Consistency Models, SDXL Turbo, and other methods for real-time generation

References

Core papers in chronological order:

Sohl-Dickstein et al., "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", ICML 2015
Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution", NeurIPS 2019
Ho et al., "Denoising Diffusion Probabilistic Models" (DDPM), NeurIPS 2020
Song et al., "Denoising Diffusion Implicit Models" (DDIM), ICLR 2021
Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models", ICML 2021
Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations", ICLR 2021
Dhariwal & Nichol, "Diffusion Models Beat GANs on Image Synthesis", NeurIPS 2021
Ho & Salimans, "Classifier-Free Diffusion Guidance", NeurIPS Workshop 2022
Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (LDM/Stable Diffusion), CVPR 2022
Peebles & Xie, "Scalable Diffusion Models with Transformers" (DiT), ICCV 2023
Song et al., "Consistency Models", ICML 2023
Lipman et al., "Flow Matching for Generative Modeling", ICLR 2023