Flow Matching
Flow Matching is a family of training methods for generative models based on Continuous Normalizing Flows (CNFs). Introduced by Lipman et al. (2022) in the paper "Flow Matching for Generative Modeling", the core idea is to train a neural network via a remarkably simple regression objective to learn a continuous transformation (velocity field) from a noise distribution to a data distribution. Compared to diffusion models, Flow Matching is mathematically more intuitive, offers higher sampling efficiency, and has become the core training paradigm behind cutting-edge models such as Stable Diffusion 3 and FLUX.
Learning path: Normalizing Flows → Neural ODEs / CNF → Flow Matching (CFM) → Optimal Transport paths → Rectified Flow → Applications (SD3, FLUX)
1. Background and Motivation
The Evolution of Generative Models
Deep generative models have gone through several paradigm shifts:
| Period | Model | Core Idea | Main Limitations |
|---|---|---|---|
| 2013-2014 | VAE | Variational inference + encoder-decoder | Blurry generation quality |
| 2014-2018 | GAN | Adversarial training, generator vs. discriminator | Unstable training, mode collapse |
| 2020-2022 | Diffusion | Gradual noising → gradual denoising | Many sampling steps, complex math |
| 2022-present | Flow Matching | Directly learning velocity fields, ODE framework | Simpler theory, faster sampling |
Problems with Diffusion Models
Despite the enormous success of diffusion models in image generation, they suffer from several pain points:
- Slow sampling: Standard DDPM requires 1000 sampling steps; even accelerated methods like DDIM typically need 20-50 steps
- Complex mathematical framework: Involves forward SDEs, reverse SDEs, score functions, noise schedules (\(\alpha_t, \bar{\alpha}_t, \sigma_t\)), and numerous other terms
- Curved paths: The diffusion process follows a "curved path" from data to noise, requiring more steps during reverse sampling to track accurately
- Many design choices: The choice of noise schedule (linear, cosine, sigmoid, etc.) significantly affects generation quality, and there is no unified theoretical guidance
The Core Idea of Flow Matching
Flow Matching answers a simple yet profound question:
Can we find the shortest, straightest path from noise to data, and then directly learn the velocity along that path?
The answer is yes. Flow Matching works as follows:
- Define a linear interpolation path from noise \(x_0\) to data \(x_1\)
- Compute the velocity (direction + magnitude) along this path
- Train a neural network to predict this velocity
The entire training objective reduces to a simple mean squared error regression problem — no noise schedule, no score matching, no SDE solver needed.
2. Prerequisite: Normalizing Flows
Before understanding Flow Matching, two prerequisite concepts are needed: traditional Normalizing Flows and Continuous Normalizing Flows (Neural ODEs).
Invertible Transformations and Density Transformation
The core idea of Normalizing Flows is to transform a simple distribution (e.g., a standard Gaussian) into a complex data distribution through a series of invertible transformations.
Let \(z \sim p_z(z)\) be a simple prior distribution (e.g., \(\mathcal{N}(0, I)\)), and \(f\) be an invertible transformation. Data \(x\) is generated as:
Change of Variables Formula
Since \(f\) is invertible, we can compute the exact probability density of the data via the Change of Variables formula:
Taking the logarithm:
This means that as long as we can compute \(f^{-1}\) and the determinant of its Jacobian matrix, we can exactly compute the log-likelihood of the data and perform maximum likelihood training.
Limitations of Traditional Normalizing Flows
To make the above formula tractable, traditional Normalizing Flows (e.g., RealNVP, Glow, NICE) must satisfy two constraints:
- The transformation must be invertible: Both \(f\) and \(f^{-1}\) must exist and be computable
- The Jacobian determinant must be tractable: Computing the determinant of a general square matrix is \(O(n^3)\), so special architectural designs are required
Architectural constraints
To simultaneously satisfy invertibility and tractable determinant computation, traditional Normalizing Flows can only use special structures (e.g., affine coupling layers, autoregressive transforms), which severely limits the model's expressive power. In contrast, diffusion models and Flow Matching can use arbitrary neural network architectures (e.g., U-Net, DiT).
3. Prerequisite: Continuous Normalizing Flows (Neural ODEs)
From Discrete to Continuous
Traditional Normalizing Flows achieve distribution transformation through compositions of a finite number of invertible transformations. A natural question is: what happens if the number of steps goes to infinity and each step's change becomes infinitesimal?
The answer is: the discrete sequence of transformations becomes an Ordinary Differential Equation (ODE).
Definition of Neural ODEs
Continuous Normalizing Flows (CNFs), also known as Neural ODEs (Chen et al., 2018), define a continuous transformation of data using a vector field (velocity field) \(v_\theta(x, t)\):
where:
- \(x(0) = x_0 \sim p_0\): the starting distribution (typically standard Gaussian noise)
- \(x(1) = x_1 \sim p_1\): the target distribution (data distribution)
- \(v_\theta(x, t)\): the neural network-parameterized velocity field, indicating in which direction and at what speed a particle at position \(x\) and time \(t\) should move
Intuitive understanding: imagine countless particles in space, starting from positions drawn from the noise distribution. The velocity field \(v_\theta\) acts like a "wind field," telling each particle where to go at every position and time. As all particles flow from \(t=0\) to \(t=1\), they transform from the noise distribution to the data distribution.
t=0 (噪声) t=1 (数据)
. . .***.
. .. . v_θ(x,t) 速度场引导 * *
. . ──────────────────────────→ * * *
. . . ODE 积分 * *
. . .***.
The Continuity Equation
In CNFs, the evolution of the probability density \(p_t(x)\) over time satisfies the Continuity Equation:
This reflects conservation of mass from fluid mechanics: the "total amount" of probability is preserved during the transformation (always integrating to 1), and changes in probability density are entirely determined by the divergence of the velocity field.
Expanding the divergence term:
Physical intuition behind the continuity equation
Imagine probability density as a "fluid." \(v_t(x)\) is the flow velocity, and \(\nabla \cdot v_t(x)\) is the divergence. If the velocity field in some region is "divergent" (divergence > 0), fluid flows out of that region and the density decreases; if it is "convergent" (divergence < 0), fluid flows into the region and the density increases.
Evolution of Log-Density
For CNFs, the change in log-density along a flow line can be expressed via the instantaneous change of variables formula:
Here \(\text{tr}\left(\frac{\partial v_\theta}{\partial x}\right)\) is the trace of the velocity field's Jacobian, which replaces the full Jacobian determinant computation required in traditional Normalizing Flows.
Training Difficulties of CNFs
Although CNFs are theoretically elegant, direct training suffers from severe computational issues:
- Forward pass requires ODE solving: To obtain \(x(1)\), one must numerically integrate the entire ODE from \(x(0)\)
- Backpropagation requires passing gradients through the ODE solver: Using the adjoint method to compute gradients requires solving another ODE
- Extremely high training cost: Each training step involves multiple ODE solves, and step sizes need adaptive adjustment
These issues make traditional CNFs practically infeasible on large-scale datasets. Flow Matching was proposed precisely to address this training difficulty.
4. Core Method of Flow Matching
Objective: Learning the Marginal Velocity Field
Our ultimate goal is to learn a velocity field \(v_\theta(x, t)\) such that the ODE it defines transforms the noise distribution \(p_0\) into the data distribution \(p_1\).
Ideally, we want to minimize the Flow Matching (FM) objective:
where \(u_t(x)\) is the true velocity field that generates the probability path \(p_t(x)\).
Problem: Both \(u_t(x)\) and \(p_t(x)\) are unknown. We do not know what the "true" velocity field that transforms noise into data looks like.
Key Insight: Conditional Flow Matching
The core contribution of Lipman et al. is proving that we do not need to know the marginal velocity field \(u_t(x)\); it suffices to regress against the conditional velocity field \(u_t(x | x_1)\).
Define the conditional probability path \(p_t(x | x_1)\): given a data point \(x_1\), this is the probability path from noise to that specific data point.
Define the conditional velocity field \(u_t(x | x_1)\): the velocity field that generates the conditional probability path \(p_t(x | x_1)\).
Conditional Flow Matching (CFM) objective:
Equivalence of CFM and FM
Lipman et al. proved a key theorem: \(\mathcal{L}_{CFM}\) and \(\mathcal{L}_{FM}\) are equivalent in terms of gradients, i.e., \(\nabla_\theta \mathcal{L}_{CFM} = \nabla_\theta \mathcal{L}_{FM}\). This means the velocity field obtained by minimizing the CFM objective is identical to what we would get from directly minimizing the FM objective. This result is liberating: we only need to design simple conditional paths and conditional velocity fields.
Optimal Transport Path: The Simplest Choice
There are infinitely many choices for the conditional probability path and conditional velocity field. The simplest and most natural choice is the Optimal Transport (OT) path, also known as the linear interpolation path.
Conditional probability path (Gaussian form):
where \(\sigma_{\min}\) is a small positive number (close to 0) controlling the variance at the endpoint.
When \(\sigma_{\min} \to 0\), this simplifies to:
- \(t = 0\): \(p_0(x | x_1) = \mathcal{N}(x; 0, I)\), i.e., standard Gaussian noise
- \(t = 1\): \(p_1(x | x_1) = \mathcal{N}(x; x_1, \sigma_{\min}^2 I) \approx \delta(x - x_1)\), nearly concentrated at the data point \(x_1\)
Conditional velocity field (OT path):
The Minimal Form: Linear Interpolation
When \(\sigma_{\min} \to 0\), everything becomes extremely simple.
Interpolation formula: Given noise \(x_0 \sim \mathcal{N}(0, I)\) and data \(x_1 \sim q(x_1)\), the interpolation at time \(t\) is:
t=0 t=0.25 t=0.5 t=0.75 t=1
噪声 x₀ ─────→──────→──────→──────→ 数据 x₁
x_t = (1-t)x₀ + tx₁,线性插值
Velocity: Differentiating \(x_t\) with respect to \(t\), the velocity is a constant:
This is the target value of the conditional velocity field. The velocity is simply the endpoint minus the starting point — moving in a straight line at constant speed.
Training objective:
Connection to noise prediction in diffusion models
In diffusion models (DDPM), the training objective is to predict the added noise \(\epsilon\): \(\| \epsilon_\theta(x_t, t) - \epsilon \|^2\). In Flow Matching, the training objective is to predict the velocity \(v = x_1 - x_0\). Both are simple regression objectives, but Flow Matching's target is more intuitive: the velocity is "data minus noise," pointing from noise toward data.
5. Training and Sampling Algorithms
Training Algorithm
The training procedure for Flow Matching is extremely simple:
Algorithm: Flow Matching Training
──────────────────────────────────
Input: Dataset D, network v_θ
Repeat:
1. Sample data x₁ ~ q(x₁)
2. Sample standard Gaussian noise x₀ ~ N(0, I)
3. Sample time uniformly t ~ U[0, 1]
4. Compute interpolation: x_t = (1-t) x₀ + t x₁
5. Compute target velocity: u = x₁ - x₀
6. Compute loss: L = || v_θ(x_t, t) - u ||²
7. Backpropagate and update θ
Until convergence
Note that the training process requires no ODE solver at all. Compared to traditional CNFs, this is a qualitative leap:
- Traditional CNF: Each training step requires solving an ODE (forward + backward), incurring very high computational cost
- Flow Matching: Each training step requires only one forward pass + one backward pass, no different from training an ordinary regression network
Sampling Algorithm
After training, sampling requires integrating the ODE \(\frac{dx}{dt} = v_\theta(x, t)\) from \(t=0\) to \(t=1\).
Euler method sampling (simplest):
Algorithm: Flow Matching Sampling (Euler Method)
──────────────────────────────────
Input: Trained v_θ, number of steps N
1. Sample x₀ ~ N(0, I)
2. Set step size Δt = 1/N
3. for k = 0, 1, ..., N-1:
t_k = k / N
x_{k+1} = x_k + Δt · v_θ(x_k, t_k)
4. Return x_N as the generated sample
Higher-order ODE solvers (more accurate):
Higher-order solvers such as RK45 (fourth-fifth order Runge-Kutta) or the midpoint method can also be used. Higher-order methods are more accurate at the same number of steps, but each step requires multiple function evaluations (Network Function Evaluations, NFE).
Choosing the number of steps
Since Flow Matching typically learns relatively straight paths (especially with OT paths), the Euler method can achieve good results with relatively few steps (e.g., 20-50). This is much fewer than the typical sampling steps for diffusion models. With further optimization via Rectified Flow, the number can be reduced to as few as 1-4 steps.
6. Detailed Comparison: Flow Matching vs. Diffusion
Framework Comparison
| Aspect | Diffusion (DDPM/Score-based) | Flow Matching |
|---|---|---|
| Stochastic process | SDE (Stochastic Differential Equation) | ODE (Ordinary Differential Equation) |
| Training objective | Predict noise \(\epsilon\) or score \(\nabla_x \log p_t(x)\) | Predict velocity \(v = x_1 - x_0\) |
| Path shape | Curved path (determined by noise schedule) | Straight path (OT path) |
| Sampling process | Reverse SDE/ODE solving | Forward ODE solving |
| Design choices | Noise schedule, \(\beta_t\), parameterization, etc. | Almost no hyperparameters |
| Mathematical complexity | High (SDE, Fokker-Planck, score matching) | Low (ODE, linear interpolation, MSE regression) |
| Sampling steps | Typically 20-1000 steps | Typically 10-50 steps |
Intuition Behind Path Comparison
Diffusion path (curved): Flow Matching path (straight):
data x₁ . data x₁ .
/ \ |
/ \ |
| | curved path | straight path
| | needs more steps | fewer steps
\ / |
\ / |
noise x₀ . noise x₀ .
Curved paths lead to larger discretization errors and require more sampling steps to maintain accuracy. Straight paths minimize discretization error — even coarse Euler steps can yield good results.
Mathematical Equivalence
It is worth noting that Diffusion and Flow Matching are mathematically equivalent under certain conditions. Specifically:
- DDPM's deterministic sampler (DDIM) is actually solving a probability flow ODE
- If we view Diffusion's noise schedule as a particular choice of probability path, it can be subsumed under the general Flow Matching framework
Therefore, Flow Matching can be seen as a more general framework than Diffusion, rather than merely an alternative. Diffusion's SDE path is one specific case among the infinitely many path choices in Flow Matching — but not the optimal one.
Why Straight Paths Are Better
The advantages of OT straight-line paths can be understood from several perspectives:
- Minimal discretization error: Straight paths have zero curvature; Euler integration on a straight line is exact (if the velocity field is perfectly learned)
- Easier to learn: The velocity along a straight path is the constant \(x_1 - x_0\), which does not change over time. The network only needs to learn "how to infer the target direction given the current position and time"
- Optimal transport interpretation: Straight paths correspond to the optimal transport plan under the quadratic Wasserstein distance — the most "economical" way to transform one distribution into another
7. Deeper Understanding: Why CFM Is Equivalent to FM
This section provides a more detailed explanation of the CFM equivalence theorem.
Construction of the Marginal Velocity Field
Given the conditional velocity field \(u_t(x | x_1)\) and the conditional probability path \(p_t(x | x_1)\), the marginal velocity field can be constructed by taking an expectation over all data points:
where the marginal probability density is:
Intuitively, the marginal velocity field is a probability-weighted average of all conditional velocity fields. At position \(x\) and time \(t\), the contribution of each data point \(x_1\) to the velocity field is weighted by \(p_t(x | x_1) q(x_1)\).
Proof Sketch of Gradient Equivalence
The key steps in proving \(\nabla_\theta \mathcal{L}_{CFM} = \nabla_\theta \mathcal{L}_{FM}\) are as follows:
- Expand \(\mathcal{L}_{FM}\) as \(\| v_\theta - u_t \|^2\), where \(u_t\) is the marginal velocity field
- Note that the cross-term relevant to \(\theta\) in the expansion is \(-2 \, \mathbb{E}[v_\theta(x,t) \cdot u_t(x)]\)
- Substitute the definition of the marginal velocity field and swap the order of integration using \(p_t(x) = \int p_t(x|x_1)q(x_1)dx_1\)
- Arrive at the same gradient expression as \(\mathcal{L}_{CFM}\)
Practical significance
This equivalence means that during training, we only need to construct simple conditional paths and conditional velocities (linear interpolation + constant velocity) for each data point \(x_1\), and the network will naturally learn the correct marginal velocity field. The conditional paths of different data points may cross (since straight lines from different noise samples to different data points can intersect), but after gradient averaging, the marginal velocity field learned by the network will correctly handle these crossings.
8. Rectified Flow: Further Straightening the Paths
Motivation
Although Flow Matching uses straight conditional paths, the marginal velocity field is a weighted average of all conditional velocity fields, so the actual marginal flow lines may still be curved.
Consider a simple example: if the data distribution has two modes (e.g., two Gaussians), the straight paths from noise to different modes will cross. In the crossing region, the velocity field is an average of two different directions, causing the actual flow lines to curve around.
The Reflow Operation
Rectified Flow (Liu et al., 2022) proposes an iterative method for straightening flow lines, called Reflow:
- Step 1: Train a velocity field \(v_\theta^{(1)}\) using standard Flow Matching
- Step 2: Generate paired data using \(v_\theta^{(1)}\). Starting from \(x_0 \sim p_0\), integrate the ODE to obtain \(\hat{x}_1 = \text{ODESolve}(x_0, v_\theta^{(1)})\)
- Step 3: Retrain Flow Matching with the new pairs \((x_0, \hat{x}_1)\) to obtain \(v_\theta^{(2)}\)
- Repeat: Continue iterating the reflow process
Round 1 training: random pairing (x₀, x₁)
Paths cross, marginal flow lines are curved
× ← path crossing
/ \
x₀ₐ ────/──────→ x₁ₐ
x₀ᵦ ──/────────→ x₁ᵦ
After reflow: pairs become (x₀, ODESolve(x₀))
Fewer crossings, straighter flow lines
x₀ₐ ──────────→ x₁ₐ
x₀ᵦ ──────────→ x₁ᵦ
The Ultimate Goal: Straight-Line Paths
After reflow iterations, the flow lines become increasingly straight. When the flow lines are perfectly straight, a single Euler step yields exact sampling:
This means sampling degenerates to a single forward pass, comparable to the sampling speed of GANs.
Reflow and distillation
Reflow can be combined with knowledge distillation. First straighten the paths via reflow, then compress multi-step sampling to 1-2 steps through progressive distillation. InstaFlow (2023) is a successful example of this approach, achieving high-quality single-step image generation.
9. Connection to Score Matching
Flow Matching and Score-based Diffusion Models (score matching) share a close mathematical relationship.
Relationship Between Score and Velocity
In the Diffusion framework, the score function is defined as:
In the Flow Matching framework, the relationship between the velocity field and the score is:
where \(\mu_t\) and \(\sigma_t\) are the time-dependent mean and standard deviation of the conditional distribution, respectively.
For the OT path (\(\mu_t = t x_1\), \(\sigma_t = 1-t\)):
Unifying Prediction Targets
Different parameterizations actually predict different but equivalent quantities:
| Parameterization | Prediction target | Relationship |
|---|---|---|
| \(\epsilon\)-prediction | Noise \(\epsilon = x_0\) | DDPM standard |
| \(v\)-prediction | Velocity \(v = x_1 - x_0\) | Flow Matching standard |
| \(x\)-prediction | Data \(x_1\) | Directly predicting the denoised result |
| score prediction | \(\nabla_x \log p_t\) | Score-based standard |
Under the OT path, these prediction targets can be converted to one another:
10. Practical Applications and Impact
Stable Diffusion 3 (SD3)
Stability AI fully adopted the Flow Matching training paradigm in Stable Diffusion 3, with the following improvements:
- Rectified Flow training: Replaced traditional DDPM training with Flow Matching using OT paths
- DiT architecture (MM-DiT): Replaced U-Net with Transformers, jointly processing text and image tokens in a single sequence
- Improved sampling efficiency: Compared to SD 1.5/2.x, SD3 achieves better image quality with fewer sampling steps
FLUX
The FLUX model series from Black Forest Labs (founded by former core members of Stability AI) further advanced the Flow Matching + DiT paradigm:
- FLUX.1 [pro/dev/schnell]: Different configurations for different needs
- schnell version: Achieves 1-4 step generation through distillation
- Guidance distillation: Distills the multi-step classifier-free guidance process into fewer steps
Video Generation
Flow Matching has shown great potential in video generation:
- Spatiotemporal consistency: The ODE framework is naturally suited for modeling continuous temporal changes
- Controllable generation: The interpretability of velocity fields makes conditional control more intuitive
- Models such as Meta Movie Gen and Sora have adopted Flow Matching or similar continuous flow frameworks
Other Application Areas
- Audio generation: Voicebox (Meta) uses Flow Matching for speech synthesis
- Protein design: Methods like FrameFlow apply Flow Matching to protein structure generation
- 3D generation: Applications of Flow Matching in point cloud generation, NeRF optimization, and other 3D tasks
- Scientific computing: Molecular dynamics simulation, weather prediction, and other domains
11. Code Intuition: A Minimal Flow Matching Implementation
The following pseudocode demonstrates the core implementation of Flow Matching, illustrating its simplicity:
# ========== Training ==========
def train_step(model, x1, optimizer):
"""
model: 速度场网络 v_θ(x, t)
x1: 一个 batch 的真实数据, shape [B, D]
"""
# 1. 采样噪声和时间
x0 = torch.randn_like(x1) # 标准高斯噪声
t = torch.rand(x1.shape[0], 1) # 均匀采样 t ~ U[0,1]
# 2. 线性插值得到 x_t
xt = (1 - t) * x0 + t * x1
# 3. 目标速度 = 终点 - 起点
target = x1 - x0
# 4. 网络预测速度
pred = model(xt, t)
# 5. MSE 损失
loss = ((pred - target) ** 2).mean()
# 6. 反向传播更新
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss
# ========== Sampling ==========
def sample(model, shape, num_steps=50):
"""
从噪声采样,通过 Euler 积分生成数据
"""
x = torch.randn(shape) # x_0 ~ N(0, I)
dt = 1.0 / num_steps
for i in range(num_steps):
t = torch.full((shape[0], 1), i / num_steps)
v = model(x, t) # 预测速度
x = x + v * dt # Euler 步进
return x # 近似 x_1(生成的数据)
This is a simplified version
In practice, additional considerations include: time \(t\) encoding (sinusoidal embedding), conditional inputs (text embeddings, class labels), classifier-free guidance, EMA models, and more. But the core training loop is indeed this simple.
12. Reflections and Discussion
Will Flow Matching Replace Diffusion?
From a practical standpoint, Flow Matching is gradually becoming mainstream. Next-generation models like SD3 and FLUX have fully transitioned to Flow Matching. But a more accurate characterization is: Flow Matching is a natural evolution of Diffusion, rather than a complete replacement.
The relationship between the two is analogous to:
- Diffusion introduced the core paradigm of "iterative denoising"
- Flow Matching found a better way to train this paradigm
- The underlying generative philosophy is a continuous lineage
The Fundamental Difference Between ODE and SDE
- ODE (deterministic): Given an initial point, the trajectory is uniquely determined. The same noise always generates the same data
- SDE (stochastic): Trajectories contain randomness; the same noise may generate different data
The stochasticity of SDEs is sometimes useful (e.g., it can improve sample diversity), so some methods add a small amount of random noise (stochastic sampling) to the Flow Matching ODE framework, combining the strengths of both approaches.
Why Straighter Paths Are Better: Summary
- Sampling efficiency: Straight paths allow even coarse numerical integration to produce good results
- Learning efficiency: Constant velocity is easier to learn than velocity that changes dramatically over time
- Theoretical optimality: OT paths correspond to the optimal transport plan under the Wasserstein distance
- Practical value: The straighter the path, the easier it is to compress to very few sampling steps via distillation
Open Questions
- Choosing the optimal path: Is the OT straight-line path truly optimal in all scenarios? For complex multimodal distributions, might curved paths sometimes be better?
- Theoretical analysis of discretization error: What is the relationship between Flow Matching's error bound and the number of sampling steps under finite-step sampling?
- Relationship with Consistency Models: Consistency Models directly learn the mapping between any two points on an ODE trajectory — are they more efficient than Rectified Flow?
- OT in high-dimensional spaces: In very high-dimensional data spaces (e.g., megapixel images), do linear OT paths remain effective? How large is the gap between conditional OT and global OT?
References
- Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow Matching for Generative Modeling. ICLR 2023.
- Liu, X., Gong, C., & Liu, Q. (2022). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023.
- Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural Ordinary Differential Equations. NeurIPS 2018.
- Albergo, M. S., & Vanden-Eijnden, E. (2022). Building Normalizing Flows with Stochastic Interpolants. ICLR 2023.
- Esser, P., Kulal, S., Blattmann, A., et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. (Stable Diffusion 3 技术报告)
- Tong, A., Malkin, N., Huguet, G., et al. (2023). Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport. TMLR 2024.