Flow Matching

Flow Matching is a family of training methods for generative models based on Continuous Normalizing Flows (CNFs). Introduced by Lipman et al. (2022) in the paper "Flow Matching for Generative Modeling", the core idea is to train a neural network via a remarkably simple regression objective to learn a continuous transformation (velocity field) from a noise distribution to a data distribution. Compared to diffusion models, Flow Matching is mathematically more intuitive, offers higher sampling efficiency, and has become the core training paradigm behind cutting-edge models such as Stable Diffusion 3 and FLUX.

Learning path: Normalizing Flows → Neural ODEs / CNF → Flow Matching (CFM) → Optimal Transport paths → Rectified Flow → Applications (SD3, FLUX)

1. Background and Motivation

The Evolution of Generative Models

Deep generative models have gone through several paradigm shifts:

Period	Model	Core Idea	Main Limitations
2013-2014	VAE	Variational inference + encoder-decoder	Blurry generation quality
2014-2018	GAN	Adversarial training, generator vs. discriminator	Unstable training, mode collapse
2020-2022	Diffusion	Gradual noising → gradual denoising	Many sampling steps, complex math
2022-present	Flow Matching	Directly learning velocity fields, ODE framework	Simpler theory, faster sampling

Problems with Diffusion Models

Despite the enormous success of diffusion models in image generation, they suffer from several pain points:

Slow sampling: Standard DDPM requires 1000 sampling steps; even accelerated methods like DDIM typically need 20-50 steps
Complex mathematical framework: Involves forward SDEs, reverse SDEs, score functions, noise schedules (\(\alpha_t, \bar{\alpha}_t, \sigma_t\)), and numerous other terms
Curved paths: The diffusion process follows a "curved path" from data to noise, requiring more steps during reverse sampling to track accurately
Many design choices: The choice of noise schedule (linear, cosine, sigmoid, etc.) significantly affects generation quality, and there is no unified theoretical guidance

The Core Idea of Flow Matching

Flow Matching answers a simple yet profound question:

Can we find the shortest, straightest path from noise to data, and then directly learn the velocity along that path?

The answer is yes. Flow Matching works as follows:

Define a linear interpolation path from noise \(x_0\) to data \(x_1\)
Compute the velocity (direction + magnitude) along this path
Train a neural network to predict this velocity

The entire training objective reduces to a simple mean squared error regression problem — no noise schedule, no score matching, no SDE solver needed.

2. Prerequisite: Normalizing Flows

Before understanding Flow Matching, two prerequisite concepts are needed: traditional Normalizing Flows and Continuous Normalizing Flows (Neural ODEs).

Invertible Transformations and Density Transformation

The core idea of Normalizing Flows is to transform a simple distribution (e.g., a standard Gaussian) into a complex data distribution through a series of invertible transformations.

Let \(z \sim p_z(z)\) be a simple prior distribution (e.g., \(\mathcal{N}(0, I)\)), and \(f\) be an invertible transformation. Data \(x\) is generated as:

\[ x = f(z), \quad z = f^{-1}(x) \]

Change of Variables Formula

Since \(f\) is invertible, we can compute the exact probability density of the data via the Change of Variables formula:

\[ p_x(x) = p_z(f^{-1}(x)) \cdot \left| \det \frac{\partial f^{-1}}{\partial x} \right| \]

Taking the logarithm:

\[ \log p_x(x) = \log p_z(f^{-1}(x)) + \log \left| \det \frac{\partial f^{-1}}{\partial x} \right| \]

This means that as long as we can compute \(f^{-1}\) and the determinant of its Jacobian matrix, we can exactly compute the log-likelihood of the data and perform maximum likelihood training.

Limitations of Traditional Normalizing Flows

To make the above formula tractable, traditional Normalizing Flows (e.g., RealNVP, Glow, NICE) must satisfy two constraints:

The transformation must be invertible: Both \(f\) and \(f^{-1}\) must exist and be computable
The Jacobian determinant must be tractable: Computing the determinant of a general square matrix is \(O(n^3)\), so special architectural designs are required

Architectural constraints

To simultaneously satisfy invertibility and tractable determinant computation, traditional Normalizing Flows can only use special structures (e.g., affine coupling layers, autoregressive transforms), which severely limits the model's expressive power. In contrast, diffusion models and Flow Matching can use arbitrary neural network architectures (e.g., U-Net, DiT).

3. Prerequisite: Continuous Normalizing Flows (Neural ODEs)

From Discrete to Continuous

Traditional Normalizing Flows achieve distribution transformation through compositions of a finite number of invertible transformations. A natural question is: what happens if the number of steps goes to infinity and each step's change becomes infinitesimal?

The answer is: the discrete sequence of transformations becomes an Ordinary Differential Equation (ODE).

Definition of Neural ODEs

Continuous Normalizing Flows (CNFs), also known as Neural ODEs (Chen et al., 2018), define a continuous transformation of data using a vector field (velocity field) \(v_\theta(x, t)\):

\[ \frac{dx}{dt} = v_\theta(x, t), \quad t \in [0, 1] \]

where:

\(x(0) = x_0 \sim p_0\): the starting distribution (typically standard Gaussian noise)
\(x(1) = x_1 \sim p_1\): the target distribution (data distribution)
\(v_\theta(x, t)\): the neural network-parameterized velocity field, indicating in which direction and at what speed a particle at position \(x\) and time \(t\) should move

Intuitive understanding: imagine countless particles in space, starting from positions drawn from the noise distribution. The velocity field \(v_\theta\) acts like a "wind field," telling each particle where to go at every position and time. As all particles flow from \(t=0\) to \(t=1\), they transform from the noise distribution to the data distribution.

t=0 (噪声)                                    t=1 (数据)
  .  .                                         .***.
 . .. .         v_θ(x,t) 速度场引导            *   *
  . .     ──────────────────────────→          *  *  *
 .  . .        ODE 积分                         *   *
  .  .                                          .***.

The Continuity Equation

In CNFs, the evolution of the probability density \(p_t(x)\) over time satisfies the Continuity Equation:

\[ \frac{\partial p_t(x)}{\partial t} + \nabla \cdot (p_t(x) \, v_t(x)) = 0 \]

This reflects conservation of mass from fluid mechanics: the "total amount" of probability is preserved during the transformation (always integrating to 1), and changes in probability density are entirely determined by the divergence of the velocity field.

Expanding the divergence term:

\[ \frac{\partial p_t(x)}{\partial t} = -p_t(x) \, \nabla \cdot v_t(x) - v_t(x) \cdot \nabla p_t(x) \]

Physical intuition behind the continuity equation

Imagine probability density as a "fluid." \(v_t(x)\) is the flow velocity, and \(\nabla \cdot v_t(x)\) is the divergence. If the velocity field in some region is "divergent" (divergence > 0), fluid flows out of that region and the density decreases; if it is "convergent" (divergence < 0), fluid flows into the region and the density increases.

Evolution of Log-Density

For CNFs, the change in log-density along a flow line can be expressed via the instantaneous change of variables formula:

\[ \frac{d \log p_t(x(t))}{dt} = -\nabla \cdot v_\theta(x(t), t) = -\text{tr}\left(\frac{\partial v_\theta}{\partial x}\right) \]

Here \(\text{tr}\left(\frac{\partial v_\theta}{\partial x}\right)\) is the trace of the velocity field's Jacobian, which replaces the full Jacobian determinant computation required in traditional Normalizing Flows.

Training Difficulties of CNFs

Although CNFs are theoretically elegant, direct training suffers from severe computational issues:

Forward pass requires ODE solving: To obtain \(x(1)\), one must numerically integrate the entire ODE from \(x(0)\)
Backpropagation requires passing gradients through the ODE solver: Using the adjoint method to compute gradients requires solving another ODE
Extremely high training cost: Each training step involves multiple ODE solves, and step sizes need adaptive adjustment

These issues make traditional CNFs practically infeasible on large-scale datasets. Flow Matching was proposed precisely to address this training difficulty.

4. Core Method of Flow Matching

Objective: Learning the Marginal Velocity Field

Our ultimate goal is to learn a velocity field \(v_\theta(x, t)\) such that the ODE it defines transforms the noise distribution \(p_0\) into the data distribution \(p_1\).

Ideally, we want to minimize the Flow Matching (FM) objective:

\[ \mathcal{L}_{FM} = \mathbb{E}_{t \sim \mathcal{U}[0,1], \, x \sim p_t(x)} \| v_\theta(x, t) - u_t(x) \|^2 \]

where \(u_t(x)\) is the true velocity field that generates the probability path \(p_t(x)\).

Problem: Both \(u_t(x)\) and \(p_t(x)\) are unknown. We do not know what the "true" velocity field that transforms noise into data looks like.

Key Insight: Conditional Flow Matching

The core contribution of Lipman et al. is proving that we do not need to know the marginal velocity field \(u_t(x)\); it suffices to regress against the conditional velocity field \(u_t(x | x_1)\).

Define the conditional probability path \(p_t(x | x_1)\): given a data point \(x_1\), this is the probability path from noise to that specific data point.

Define the conditional velocity field \(u_t(x | x_1)\): the velocity field that generates the conditional probability path \(p_t(x | x_1)\).

Conditional Flow Matching (CFM) objective:

\[ \mathcal{L}_{CFM} = \mathbb{E}_{t \sim \mathcal{U}[0,1], \, x_1 \sim q(x_1), \, x \sim p_t(x|x_1)} \| v_\theta(x, t) - u_t(x | x_1) \|^2 \]

Equivalence of CFM and FM

Lipman et al. proved a key theorem: \(\mathcal{L}_{CFM}\) and \(\mathcal{L}_{FM}\) are equivalent in terms of gradients, i.e., \(\nabla_\theta \mathcal{L}_{CFM} = \nabla_\theta \mathcal{L}_{FM}\). This means the velocity field obtained by minimizing the CFM objective is identical to what we would get from directly minimizing the FM objective. This result is liberating: we only need to design simple conditional paths and conditional velocity fields.

Optimal Transport Path: The Simplest Choice

There are infinitely many choices for the conditional probability path and conditional velocity field. The simplest and most natural choice is the Optimal Transport (OT) path, also known as the linear interpolation path.

Conditional probability path (Gaussian form):

\[ p_t(x | x_1) = \mathcal{N}(x \,;\, t \cdot x_1, \, (1 - (1 - \sigma_{\min})t)^2 I) \]

where \(\sigma_{\min}\) is a small positive number (close to 0) controlling the variance at the endpoint.

When \(\sigma_{\min} \to 0\), this simplifies to:

\(t = 0\): \(p_0(x | x_1) = \mathcal{N}(x; 0, I)\), i.e., standard Gaussian noise
\(t = 1\): \(p_1(x | x_1) = \mathcal{N}(x; x_1, \sigma_{\min}^2 I) \approx \delta(x - x_1)\), nearly concentrated at the data point \(x_1\)

Conditional velocity field (OT path):

\[ u_t(x | x_1) = \frac{x_1 - (1 - \sigma_{\min}) x}{1 - (1 - \sigma_{\min}) t} \]

The Minimal Form: Linear Interpolation

When \(\sigma_{\min} \to 0\), everything becomes extremely simple.

Interpolation formula: Given noise \(x_0 \sim \mathcal{N}(0, I)\) and data \(x_1 \sim q(x_1)\), the interpolation at time \(t\) is:

\[ x_t = (1 - t) \, x_0 + t \, x_1 \]

t=0        t=0.25       t=0.5        t=0.75       t=1
噪声 x₀ ─────→──────→──────→──────→ 数据 x₁

x_t = (1-t)x₀ + tx₁，线性插值

Velocity: Differentiating \(x_t\) with respect to \(t\), the velocity is a constant:

\[ \frac{dx_t}{dt} = x_1 - x_0 \]

This is the target value of the conditional velocity field. The velocity is simply the endpoint minus the starting point — moving in a straight line at constant speed.

Training objective:

\[ \mathcal{L}_{CFM} = \mathbb{E}_{t, \, x_0, \, x_1} \| v_\theta(x_t, t) - (x_1 - x_0) \|^2 \]

Connection to noise prediction in diffusion models

In diffusion models (DDPM), the training objective is to predict the added noise \(\epsilon\): \(\| \epsilon_\theta(x_t, t) - \epsilon \|^2\). In Flow Matching, the training objective is to predict the velocity \(v = x_1 - x_0\). Both are simple regression objectives, but Flow Matching's target is more intuitive: the velocity is "data minus noise," pointing from noise toward data.

5. Training and Sampling Algorithms

Training Algorithm

The training procedure for Flow Matching is extremely simple:

Algorithm: Flow Matching Training
──────────────────────────────────
Input: Dataset D, network v_θ
Repeat:
    1. Sample data x₁ ~ q(x₁)
    2. Sample standard Gaussian noise x₀ ~ N(0, I)
    3. Sample time uniformly t ~ U[0, 1]
    4. Compute interpolation: x_t = (1-t) x₀ + t x₁
    5. Compute target velocity: u = x₁ - x₀
    6. Compute loss: L = || v_θ(x_t, t) - u ||²
    7. Backpropagate and update θ
Until convergence

Note that the training process requires no ODE solver at all. Compared to traditional CNFs, this is a qualitative leap:

Traditional CNF: Each training step requires solving an ODE (forward + backward), incurring very high computational cost
Flow Matching: Each training step requires only one forward pass + one backward pass, no different from training an ordinary regression network

Sampling Algorithm

After training, sampling requires integrating the ODE \(\frac{dx}{dt} = v_\theta(x, t)\) from \(t=0\) to \(t=1\).

Euler method sampling (simplest):

Algorithm: Flow Matching Sampling (Euler Method)
──────────────────────────────────
Input: Trained v_θ, number of steps N
1. Sample x₀ ~ N(0, I)
2. Set step size Δt = 1/N
3. for k = 0, 1, ..., N-1:
       t_k = k / N
       x_{k+1} = x_k + Δt · v_θ(x_k, t_k)
4. Return x_N as the generated sample

Higher-order ODE solvers (more accurate):

Higher-order solvers such as RK45 (fourth-fifth order Runge-Kutta) or the midpoint method can also be used. Higher-order methods are more accurate at the same number of steps, but each step requires multiple function evaluations (Network Function Evaluations, NFE).

Choosing the number of steps

Since Flow Matching typically learns relatively straight paths (especially with OT paths), the Euler method can achieve good results with relatively few steps (e.g., 20-50). This is much fewer than the typical sampling steps for diffusion models. With further optimization via Rectified Flow, the number can be reduced to as few as 1-4 steps.

6. Detailed Comparison: Flow Matching vs. Diffusion

Framework Comparison

Aspect	Diffusion (DDPM/Score-based)	Flow Matching
Stochastic process	SDE (Stochastic Differential Equation)	ODE (Ordinary Differential Equation)
Training objective	Predict noise \(\epsilon\) or score \(\nabla_x \log p_t(x)\)	Predict velocity \(v = x_1 - x_0\)
Path shape	Curved path (determined by noise schedule)	Straight path (OT path)
Sampling process	Reverse SDE/ODE solving	Forward ODE solving
Design choices	Noise schedule, \(\beta_t\), parameterization, etc.	Almost no hyperparameters
Mathematical complexity	High (SDE, Fokker-Planck, score matching)	Low (ODE, linear interpolation, MSE regression)
Sampling steps	Typically 20-1000 steps	Typically 10-50 steps

Intuition Behind Path Comparison

Diffusion path (curved):                Flow Matching path (straight):

data x₁   .                          data x₁   .
          / \                                    |
         /   \                                   |
        |     |    curved path                   |    straight path
        |     |    needs more steps              |    fewer steps
         \   /                                   |
          \ /                                    |
noise x₀   .                          noise x₀   .

Curved paths lead to larger discretization errors and require more sampling steps to maintain accuracy. Straight paths minimize discretization error — even coarse Euler steps can yield good results.

Mathematical Equivalence

It is worth noting that Diffusion and Flow Matching are mathematically equivalent under certain conditions. Specifically:

DDPM's deterministic sampler (DDIM) is actually solving a probability flow ODE
If we view Diffusion's noise schedule as a particular choice of probability path, it can be subsumed under the general Flow Matching framework

Therefore, Flow Matching can be seen as a more general framework than Diffusion, rather than merely an alternative. Diffusion's SDE path is one specific case among the infinitely many path choices in Flow Matching — but not the optimal one.

Why Straight Paths Are Better

The advantages of OT straight-line paths can be understood from several perspectives:

Minimal discretization error: Straight paths have zero curvature; Euler integration on a straight line is exact (if the velocity field is perfectly learned)
Easier to learn: The velocity along a straight path is the constant \(x_1 - x_0\), which does not change over time. The network only needs to learn "how to infer the target direction given the current position and time"
Optimal transport interpretation: Straight paths correspond to the optimal transport plan under the quadratic Wasserstein distance — the most "economical" way to transform one distribution into another

7. Deeper Understanding: Why CFM Is Equivalent to FM

This section provides a more detailed explanation of the CFM equivalence theorem.

Construction of the Marginal Velocity Field

Given the conditional velocity field \(u_t(x | x_1)\) and the conditional probability path \(p_t(x | x_1)\), the marginal velocity field can be constructed by taking an expectation over all data points:

\[ u_t(x) = \frac{\int u_t(x | x_1) \, p_t(x | x_1) \, q(x_1) \, dx_1}{p_t(x)} \]

where the marginal probability density is:

\[ p_t(x) = \int p_t(x | x_1) \, q(x_1) \, dx_1 \]

Intuitively, the marginal velocity field is a probability-weighted average of all conditional velocity fields. At position \(x\) and time \(t\), the contribution of each data point \(x_1\) to the velocity field is weighted by \(p_t(x | x_1) q(x_1)\).

Proof Sketch of Gradient Equivalence

The key steps in proving \(\nabla_\theta \mathcal{L}_{CFM} = \nabla_\theta \mathcal{L}_{FM}\) are as follows:

Expand \(\mathcal{L}_{FM}\) as \(\| v_\theta - u_t \|^2\), where \(u_t\) is the marginal velocity field
Note that the cross-term relevant to \(\theta\) in the expansion is \(-2 \, \mathbb{E}[v_\theta(x,t) \cdot u_t(x)]\)
Substitute the definition of the marginal velocity field and swap the order of integration using \(p_t(x) = \int p_t(x|x_1)q(x_1)dx_1\)
Arrive at the same gradient expression as \(\mathcal{L}_{CFM}\)

Practical significance

This equivalence means that during training, we only need to construct simple conditional paths and conditional velocities (linear interpolation + constant velocity) for each data point \(x_1\), and the network will naturally learn the correct marginal velocity field. The conditional paths of different data points may cross (since straight lines from different noise samples to different data points can intersect), but after gradient averaging, the marginal velocity field learned by the network will correctly handle these crossings.

8. Rectified Flow: Further Straightening the Paths

Motivation

Although Flow Matching uses straight conditional paths, the marginal velocity field is a weighted average of all conditional velocity fields, so the actual marginal flow lines may still be curved.

Consider a simple example: if the data distribution has two modes (e.g., two Gaussians), the straight paths from noise to different modes will cross. In the crossing region, the velocity field is an average of two different directions, causing the actual flow lines to curve around.

The Reflow Operation

Rectified Flow (Liu et al., 2022) proposes an iterative method for straightening flow lines, called Reflow:

Step 1: Train a velocity field \(v_\theta^{(1)}\) using standard Flow Matching
Step 2: Generate paired data using \(v_\theta^{(1)}\). Starting from \(x_0 \sim p_0\), integrate the ODE to obtain \(\hat{x}_1 = \text{ODESolve}(x_0, v_\theta^{(1)})\)
Step 3: Retrain Flow Matching with the new pairs \((x_0, \hat{x}_1)\) to obtain \(v_\theta^{(2)}\)
Repeat: Continue iterating the reflow process

Round 1 training: random pairing (x₀, x₁)
       Paths cross, marginal flow lines are curved

               ×  ←  path crossing
             /   \
  x₀ₐ ────/──────→ x₁ₐ
  x₀ᵦ ──/────────→ x₁ᵦ


After reflow: pairs become (x₀, ODESolve(x₀))
       Fewer crossings, straighter flow lines

  x₀ₐ ──────────→ x₁ₐ
  x₀ᵦ ──────────→ x₁ᵦ

The Ultimate Goal: Straight-Line Paths

After reflow iterations, the flow lines become increasingly straight. When the flow lines are perfectly straight, a single Euler step yields exact sampling:

\[ x_1 = x_0 + 1 \cdot v_\theta(x_0, 0) = x_0 + v_\theta(x_0, 0) \]

This means sampling degenerates to a single forward pass, comparable to the sampling speed of GANs.

Reflow and distillation

Reflow can be combined with knowledge distillation. First straighten the paths via reflow, then compress multi-step sampling to 1-2 steps through progressive distillation. InstaFlow (2023) is a successful example of this approach, achieving high-quality single-step image generation.

9. Connection to Score Matching

Flow Matching and Score-based Diffusion Models (score matching) share a close mathematical relationship.

Relationship Between Score and Velocity

In the Diffusion framework, the score function is defined as:

\[ s_t(x) = \nabla_x \log p_t(x) \]

In the Flow Matching framework, the relationship between the velocity field and the score is:

\[ v_t(x) = \dot{\mu}_t + \dot{\sigma}_t \sigma_t \, s_t(x) \]

where \(\mu_t\) and \(\sigma_t\) are the time-dependent mean and standard deviation of the conditional distribution, respectively.

For the OT path (\(\mu_t = t x_1\), \(\sigma_t = 1-t\)):

\[ v_t(x) = x_1 + (-(1-t)) \cdot s_t(x) \cdot (1-t)^{-1} = x_1 - \frac{x - tx_1}{1-t} \]

Unifying Prediction Targets

Different parameterizations actually predict different but equivalent quantities:

Parameterization	Prediction target	Relationship
\(\epsilon\)-prediction	Noise \(\epsilon = x_0\)	DDPM standard
\(v\)-prediction	Velocity \(v = x_1 - x_0\)	Flow Matching standard
\(x\)-prediction	Data \(x_1\)	Directly predicting the denoised result
score prediction	\(\nabla_x \log p_t\)	Score-based standard

Under the OT path, these prediction targets can be converted to one another:

\[ v = x_1 - x_0, \quad x_1 = x_t + (1-t) v, \quad x_0 = x_t - t \cdot v \]

10. Practical Applications and Impact

Stable Diffusion 3 (SD3)

Stability AI fully adopted the Flow Matching training paradigm in Stable Diffusion 3, with the following improvements:

Rectified Flow training: Replaced traditional DDPM training with Flow Matching using OT paths
DiT architecture (MM-DiT): Replaced U-Net with Transformers, jointly processing text and image tokens in a single sequence
Improved sampling efficiency: Compared to SD 1.5/2.x, SD3 achieves better image quality with fewer sampling steps

FLUX

The FLUX model series from Black Forest Labs (founded by former core members of Stability AI) further advanced the Flow Matching + DiT paradigm:

FLUX.1 [pro/dev/schnell]: Different configurations for different needs
schnell version: Achieves 1-4 step generation through distillation
Guidance distillation: Distills the multi-step classifier-free guidance process into fewer steps

Video Generation

Flow Matching has shown great potential in video generation:

Spatiotemporal consistency: The ODE framework is naturally suited for modeling continuous temporal changes
Controllable generation: The interpretability of velocity fields makes conditional control more intuitive
Models such as Meta Movie Gen and Sora have adopted Flow Matching or similar continuous flow frameworks

Other Application Areas

Audio generation: Voicebox (Meta) uses Flow Matching for speech synthesis
Protein design: Methods like FrameFlow apply Flow Matching to protein structure generation
3D generation: Applications of Flow Matching in point cloud generation, NeRF optimization, and other 3D tasks
Scientific computing: Molecular dynamics simulation, weather prediction, and other domains

11. Code Intuition: A Minimal Flow Matching Implementation

The following pseudocode demonstrates the core implementation of Flow Matching, illustrating its simplicity:

# ========== Training ==========
def train_step(model, x1, optimizer):
    """
    model: 速度场网络 v_θ(x, t)
    x1: 一个 batch 的真实数据, shape [B, D]
    """
    # 1. 采样噪声和时间
    x0 = torch.randn_like(x1)           # 标准高斯噪声
    t = torch.rand(x1.shape[0], 1)      # 均匀采样 t ~ U[0,1]

    # 2. 线性插值得到 x_t
    xt = (1 - t) * x0 + t * x1

    # 3. 目标速度 = 终点 - 起点
    target = x1 - x0

    # 4. 网络预测速度
    pred = model(xt, t)

    # 5. MSE 损失
    loss = ((pred - target) ** 2).mean()

    # 6. 反向传播更新
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss


# ========== Sampling ==========
def sample(model, shape, num_steps=50):
    """
    从噪声采样，通过 Euler 积分生成数据
    """
    x = torch.randn(shape)              # x_0 ~ N(0, I)
    dt = 1.0 / num_steps

    for i in range(num_steps):
        t = torch.full((shape[0], 1), i / num_steps)
        v = model(x, t)                  # 预测速度
        x = x + v * dt                   # Euler 步进

    return x                             # 近似 x_1（生成的数据）

This is a simplified version

In practice, additional considerations include: time \(t\) encoding (sinusoidal embedding), conditional inputs (text embeddings, class labels), classifier-free guidance, EMA models, and more. But the core training loop is indeed this simple.

12. Reflections and Discussion

Will Flow Matching Replace Diffusion?

From a practical standpoint, Flow Matching is gradually becoming mainstream. Next-generation models like SD3 and FLUX have fully transitioned to Flow Matching. But a more accurate characterization is: Flow Matching is a natural evolution of Diffusion, rather than a complete replacement.

The relationship between the two is analogous to:

Diffusion introduced the core paradigm of "iterative denoising"
Flow Matching found a better way to train this paradigm
The underlying generative philosophy is a continuous lineage

The Fundamental Difference Between ODE and SDE

ODE (deterministic): Given an initial point, the trajectory is uniquely determined. The same noise always generates the same data
SDE (stochastic): Trajectories contain randomness; the same noise may generate different data

The stochasticity of SDEs is sometimes useful (e.g., it can improve sample diversity), so some methods add a small amount of random noise (stochastic sampling) to the Flow Matching ODE framework, combining the strengths of both approaches.

Why Straighter Paths Are Better: Summary

Sampling efficiency: Straight paths allow even coarse numerical integration to produce good results
Learning efficiency: Constant velocity is easier to learn than velocity that changes dramatically over time
Theoretical optimality: OT paths correspond to the optimal transport plan under the Wasserstein distance
Practical value: The straighter the path, the easier it is to compress to very few sampling steps via distillation

Open Questions

Choosing the optimal path: Is the OT straight-line path truly optimal in all scenarios? For complex multimodal distributions, might curved paths sometimes be better?
Theoretical analysis of discretization error: What is the relationship between Flow Matching's error bound and the number of sampling steps under finite-step sampling?
Relationship with Consistency Models: Consistency Models directly learn the mapping between any two points on an ODE trajectory — are they more efficient than Rectified Flow?
OT in high-dimensional spaces: In very high-dimensional data spaces (e.g., megapixel images), do linear OT paths remain effective? How large is the gap between conditional OT and global OT?

References

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow Matching for Generative Modeling. ICLR 2023.
Liu, X., Gong, C., & Liu, Q. (2022). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023.
Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural Ordinary Differential Equations. NeurIPS 2018.
Albergo, M. S., & Vanden-Eijnden, E. (2022). Building Normalizing Flows with Stochastic Interpolants. ICLR 2023.
Esser, P., Kulal, S., Blattmann, A., et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. (Stable Diffusion 3 技术报告)
Tong, A., Malkin, N., Huguet, G., et al. (2023). Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport. TMLR 2024.