DiT (Diffusion Transformer)

Paper: Scalable Diffusion Models with Transformers (Peebles & Xie, 2023)

DiT proposes replacing U-Net with a Transformer as the denoising backbone of diffusion models, demonstrating that Transformers exhibit excellent scaling properties for image generation tasks. This work has profoundly influenced the architectural choices of subsequent models such as Sora, SD3, and FLUX.

1. Background and Motivation

1.1 The Success of Diffusion Models

Since DDPM (Ho et al., 2020), diffusion models have achieved tremendous success in image generation. The core idea is:

Forward process: Gradually add Gaussian noise to an image until it becomes pure noise
Reverse process: Train a neural network to learn step-by-step denoising, recovering the image from pure noise

Mathematically, the forward process is defined as:

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})\]

The reverse process is parameterized by a neural network $\epsilon_\theta(x_t, t)$ that predicts the added noise.

Stable Diffusion further moves the diffusion process into the latent space by first encoding images into low-dimensional latents with a VAE, then performing diffusion on the latents, which drastically reduces computational cost.

1.2 Limitations of U-Net

Before DiT, virtually all mainstream diffusion models (DDPM, Stable Diffusion, DALL-E 2, Imagen, etc.) used U-Net as the denoising network. U-Net has the following limitations:

Highly architecture-specific: U-Net was originally designed for pixel-level prediction (first used in medical image segmentation). Its encoder-decoder + skip connection structure is not a general-purpose design
Difficult to scale: U-Net's scaling behavior is unclear, lacking the well-defined "bigger model = better performance" scaling law that Transformers exhibit
Lack of unification: NLP has already converged on the Transformer architecture, while the vision/generation domain still relies on various task-specific architectures

1.3 The Scaling Advantage of Transformers

ViT (Dosovitskiy et al., 2020) demonstrated that a pure Transformer architecture can match or even surpass CNNs on vision tasks, given sufficient data and compute. The core advantages of Transformers are:

Clear scaling behavior: Increasing model parameters, data, and compute leads to predictable performance gains
Architectural universality: The same architecture can handle text, images, audio, video, and other modalities
Mature engineering ecosystem: A wealth of optimization techniques targeting Transformers (FlashAttention, model parallelism, etc.)

1.4 The Central Question of DiT

The DiT paper poses a straightforward question:

Central Question

Can Transformers replace U-Net as the denoising backbone of diffusion models? If so, how do they scale?

The answer is affirmative. DiT demonstrates that Transformer-based diffusion models are not only feasible but exhibit excellent scaling properties -- larger models + more compute = better generation quality (lower FID).

2. DiT Architecture in Detail

2.1 Overall Architecture

The overall pipeline of DiT can be summarized with the following architecture diagram:

                          DiT 整体架构
================================================================

  输入                                              输出
  ----                                              ----
  Noisy Latent z_t          条件信息                 Predicted
  (32 x 32 x 4)         (timestep t, class y)       Noise / x_0
       |                      |                      ^
       v                      v                      |
  +---------+          +-----------+           +------------+
  | Patchify|          | Embedding |           | Unpatchify |
  | + Pos   |          | (t_emb +  |           | (Linear +  |
  | Embed   |          |  y_emb)   |           |  Reshape)  |
  +---------+          +-----------+           +------------+
       |                      |                      ^
       v                      v                      |
       +----------+-----------+                      |
                  |                                   |
                  v                                   |
       +--------------------+                         |
       |   DiT Block #1     |                         |
       +--------------------+                         |
                  |                                   |
                  v                                   |
       +--------------------+                         |
       |   DiT Block #2     |                         |
                 ...                                  |
       +--------------------+                         |
                  |                                   |
                  v                                   |
       +--------------------+                         |
       |   DiT Block #N     |-------------------------+
       +--------------------+

================================================================

2.2 Patchify: Splitting the Latent into Patches

Following the same approach ViT uses for images, DiT first divides the input noisy latent $z_t$ into non-overlapping patches:

Input latent dimensions: $H \times W \times C$ (for the ImageNet 256x256 experiments, this is $32 \times 32 \times 4$)
Patch size $p$: the paper experiments with $p = 2, 4, 8$
Number of patches: $T = \frac{H \times W}{p^2}$
Each patch is linearly projected into a $d$-dimensional vector

For example, with $p = 2$, a $32 \times 32$ latent is divided into $16 \times 16 = 256$ patches.

Impact of Patch Size

Smaller patches mean more tokens, which increases computation but preserves more information. The paper found $p = 2$ to perform best, as it retains the most spatial information. The model naming convention is DiT-{Size}/{Patch} -- for instance, DiT-XL/2 denotes an XL-sized model with patch size 2.

After patchification, standard positional embeddings are added so the model can perceive spatial position information.

2.3 Conditioning Embeddings

DiT needs to inject two types of conditioning information:

Timestep Embedding:

Uses the same sinusoidal positional encoding as DDPM to map the scalar $t$ to a vector
Then projects it to the hidden dimension via an MLP: $t_{\text{emb}} = \text{MLP}(\text{sinusoidal}(t))$

Class Label Embedding:

Uses a learnable embedding table to map the class label $y$ to a vector
$y_{\text{emb}} = \text{Embedding}(y)$

The final conditioning vector: $c = t_{\text{emb}} + y_{\text{emb}}$

2.4 DiT Block: Transformer Block + Conditioning Injection

A DiT Block is a standard Transformer Block augmented with a conditioning injection mechanism. The paper systematically compares four conditioning injection methods, which constitutes one of the paper's core experiments.

Method 1: In-Context Conditioning

Treat $t_{\text{emb}}$ and $y_{\text{emb}}$ as two additional tokens, concatenated with the patch tokens and fed together into a standard Transformer block:

\[\text{Input} = [\underbrace{t_{\text{emb}}, y_{\text{emb}}}_{\text{条件 tokens}}, \underbrace{z_1, z_2, ..., z_T}_{\text{patch tokens}}]\]

Pros: Simple to implement, requires no modifications to the Transformer architecture
Cons: The interaction between conditioning information and patch tokens relies entirely on self-attention to learn, which is less efficient

Method 2: Cross-Attention

Add a cross-attention layer to each Transformer block, where patch tokens serve as Queries and conditioning information serves as Keys/Values:

\[\text{CrossAttn}(Q = z_{\text{patches}}, K = V = c)\]

Pros: This is the same mechanism Stable Diffusion's U-Net uses for injecting text conditions, and has been proven effective
Cons: Introduces additional parameters and computation

Method 3: Adaptive LayerNorm (adaLN)

Use the conditioning vector $c$ to regress the scale ($\gamma$) and shift ($\beta$) parameters of LayerNorm:

\[\gamma, \beta = \text{MLP}(c)$$ $$\text{adaLN}(x, c) = \gamma(c) \cdot \frac{x - \mu}{\sigma} + \beta(c)\]

Pros: Parameter-efficient, no additional attention layers needed
Cons: Conditioning information is injected only through the normalization layer, potentially limiting expressiveness

Method 4: adaLN-Zero (The Optimal Approach)

Building on adaLN, an additional scale parameter $\alpha$ is introduced, and all $\gamma, \beta, \alpha$ are initialized such that each DiT block acts as an identity function at initialization. This is the approach ultimately adopted by the DiT paper.

Key Conclusion

The experimental results show that the four conditioning injection methods rank as follows in terms of performance: adaLN-Zero > adaLN > Cross-Attention > In-Context. adaLN-Zero achieves the best FID across all model sizes.

2.5 Unpatchify: Recovering the Latent from Tokens

The final step of DiT converts the Transformer's output token sequence back into a latent of the same dimensions as the input:

Final LayerNorm (modulated by adaLN-Zero)
Linear projection: maps each token from $d$ dimensions to $p \times p \times 2C$ dimensions (predicting noise and diagonal covariance)
Reshape: rearranges the token sequence back into the spatial structure $H \times W \times 2C$

3. adaLN-Zero Explained

adaLN-Zero is the key innovation of DiT and warrants a deeper understanding.

3.1 From Standard LayerNorm to adaLN-Zero

Standard LayerNorm:

\[y = \gamma \cdot \frac{x - \mu}{\sigma} + \beta\]

where $\gamma$ and $\beta$ are learnable parameters independent of the input.

Adaptive LayerNorm (adaLN):

\[\gamma, \beta = \text{MLP}(c)$$ $$y = \gamma(c) \cdot \frac{x - \mu}{\sigma} + \beta(c)\]

$\gamma$ and $\beta$ are dynamically generated from the conditioning vector $c$, achieving condition-dependent normalization.

adaLN-Zero:

On top of adaLN, an additional scale parameter $\alpha$ is introduced before each residual connection:

\[y = x + \alpha(c) \cdot \text{Block}(\text{adaLN}(x, c))\]

where $\alpha$ is likewise regressed from the condition $c$.

3.2 The Key Role of Zero Initialization

The "Zero" in adaLN-Zero refers to the initialization strategy:

The last layer of the MLP that regresses $\alpha$ has its weights initialized to all zeros
This means at the start of training, $\alpha = 0$
Therefore, the output of each DiT block is: $y = x + 0 \cdot \text{Block}(\cdot) = x$

adaLN-Zero 初始化时的等效行为：

输入 x ──────────────────────────────── 输出 x
       |                           ^
       v                           | (alpha = 0, 被屏蔽)
  +----------+    +----------+     |
  | adaLN    |--->| Attention |--x--+
  |          |    | or FFN    |  ^
  +----------+    +----------+  |
       ^                    alpha = 0
       |
  条件 c ──> MLP ──> (gamma, beta, alpha)
                      全部初始化为合理值
                      alpha 初始化为 0

Why Does Zero Initialization Work?

This strategy draws inspiration from the zero-initialized residual idea in ResNets. The core benefits are:

Training stability: At initialization, the model is equivalent to an identity mapping, allowing gradients to flow through the entire network without loss and avoiding training instability in deep networks
Progressive learning: Each block starts by "doing nothing" and gradually learns meaningful transformations, resulting in a smoother training process
Enables deeper stacking: Since the signal is not corrupted at initialization, stacking many blocks will not cause training collapse

3.3 Complete Computation Flow of adaLN-Zero

The full computation inside a single DiT Block is as follows:

DiT Block (adaLN-Zero) 内部结构
================================================================

条件向量 c ──> MLP ──> (gamma_1, beta_1, alpha_1,
                        gamma_2, beta_2, alpha_2)

输入 x
  |
  v
  adaLN(x, gamma_1, beta_1)
  |
  v
  Multi-Head Self-Attention
  |
  v
  x alpha_1 (逐元素乘以 scale)
  |
  v
  + x (残差连接) ──> x'
  |
  v
  adaLN(x', gamma_2, beta_2)
  |
  v
  Pointwise FeedForward (MLP)
  |
  v
  x alpha_2 (逐元素乘以 scale)
  |
  v
  + x' (残差连接) ──> 输出

================================================================

Each block regresses 6 vectors from the condition $c$: $(\gamma_1, \beta_1, \alpha_1)$ for the attention branch and $(\gamma_2, \beta_2, \alpha_2)$ for the FFN branch.

4. Latent DiT Workflow

The integration of DiT with the Latent Diffusion Model (LDM) is very straightforward -- simply replace U-Net with DiT while keeping everything else unchanged.

4.1 Training Pipeline

训练阶段
================================================================

真实图像 x_0         时间步 t        类别标签 y
    |                   |                |
    v                   |                |
+----------+            |                |
| VAE      |            v                v
| Encoder  |       sinusoidal +     Embedding
+----------+        MLP = t_emb     Table = y_emb
    |                   |                |
    v                   +-------+--------+
  Latent z_0                    |
    |                     c = t_emb + y_emb
    v                           |
  加噪声 (q(z_t|z_0))          |
  z_t = sqrt(a_t)*z_0          |
     + sqrt(1-a_t)*eps          |
    |                           |
    v                           v
  +-------------------------------+
  |          DiT(z_t, c)          |
  |   预测噪声 eps_theta 或 x_0  |
  +-------------------------------+
              |
              v
   Loss = ||eps - eps_theta||^2

================================================================

4.2 Inference Pipeline

推理阶段
================================================================

随机噪声 z_T ~ N(0, I)     条件信息 (t, y)
    |                           |
    v                           v
  +-------------------------------+
  |          DiT(z_t, c)          |    x N 步迭代
  |   预测噪声 eps_theta          |<---+
  +-------------------------------+    |
              |                        |
              v                        |
    去噪一步: z_{t-1} ----------------+
              |
              v (当 t=0)
         干净 latent z_0
              |
              v
         +----------+
         | VAE      |
         | Decoder  |
         +----------+
              |
              v
          生成图像 x_0

================================================================

4.3 Comparison with Stable Diffusion

Architecture Comparison

DiT differs from Stable Diffusion only in the denoising network component:

组件	Stable Diffusion	Latent DiT
图像编码器	VAE Encoder	VAE Encoder（相同）
潜空间	32x32x4 latent	32x32x4 latent（相同）
去噪网络	U-Net	DiT (Transformer)
噪声调度	DDPM / DDIM	DDPM / DDIM（相同）
图像解码器	VAE Decoder	VAE Decoder（相同）

In essence, DiT is a "drop-in replacement" for Stable Diffusion -- swap the denoising network, and everything else stays the same.

5. Scaling Experiment Results

A major contribution of the DiT paper is its systematic study of the scaling behavior of Transformer-based diffusion models.

5.1 Model Configurations

The paper designed four models of varying sizes, following the naming convention DiT-{Size}/{Patch}:

模型	隐藏维度 $d$	层数 $N$	Attention Heads	参数量 (p=2)	Gflops (p=2)
DiT-S	384	12	6	33M	6.06
DiT-B	768	12	12	130M	23.0
DiT-L	1024	24	16	458M	80.7
DiT-XL	1152	28	16	675M	119

5.2 ImageNet 256x256 Generation Results

On the class-conditional ImageNet 256x256 generation task (using classifier-free guidance):

模型	FID-50K	IS
DiT-S/2	68.4	27.1
DiT-B/2	43.5	44.1
DiT-L/2	23.3	76.3
DiT-XL/2	9.62	121.5
DiT-XL/2 (长时间训练)	2.27	278.2

Key Findings

Larger models achieve lower FID: From DiT-S to DiT-XL, FID decreases monotonically with substantial margins
Smaller patches yield better results: $p = 2$ outperforms $p = 4$, which outperforms $p = 8$, since smaller patches retain more spatial information
Smooth scaling curve: FID decreases very smoothly as compute (Gflops) increases
adaLN-Zero is optimal at all scales: The ranking of conditioning injection methods remains consistent across different model sizes

5.3 Scaling Curve

FID-50K vs. 计算量 (Gflops)  [示意]
================================================================

  FID
   ^
70 |  * DiT-S/2
   |
60 |
   |
50 |
   |    * DiT-B/2
40 |
   |
30 |
   |        * DiT-L/2
20 |
   |
10 |              * DiT-XL/2
   |
 2 |                    * DiT-XL/2 (长训练)
   +-----+-----+------+------+------> Gflops
        6     23     81    119

================================================================

This curve is one of the most important results in the DiT paper: it shows that Diffusion Transformers exhibit a scaling law similar to that of LLMs -- performance improves predictably with increased compute.

5.4 Comparison with Other Models

When sufficiently trained, DiT-XL/2 achieves an FID of 2.27, matching or surpassing the best diffusion models of its time (such as ADM and LDM):

模型	FID-50K	骨干网络
ADM (Dhariwal & Nichol)	10.94	U-Net
ADM + Classifier Guidance	4.59	U-Net
LDM-4 (Rombach et al.)	10.56	U-Net
DiT-XL/2	2.27	Transformer

6. Subsequent Impact of DiT

The emergence of DiT marks the transition of diffusion models from the "U-Net era" to the "Transformer era." Its downstream impact has been profound.

6.1 Sora (OpenAI, 2024)

OpenAI's Sora video generation model is reportedly based on an extension of the DiT architecture, extending DiT from 2D images to 3D spacetime patches. Sora's technical report explicitly cites DiT and describes the idea of treating video as a sequence of "spacetime patches."

6.2 SD3 / FLUX: MM-DiT

Stable Diffusion 3 and the FLUX series adopt the MM-DiT (Multi-Modal DiT) architecture:

Dual-stream design: image tokens and text tokens each have their own independent Transformer stream
Interaction occurs at the attention layers: tokens from both streams are concatenated for joint attention
More flexible than the original DiT's cross-attention approach

6.3 PixArt-alpha

PixArt-alpha focuses on efficient training of DiT:

Uses a pretrained T5 text encoder to provide text conditioning
Employs a staged training strategy to reduce training costs
Demonstrates that the DiT architecture can produce high-quality models even with limited computational resources

6.4 SiT (Scalable Interpolant Transformers)

SiT retains DiT's Transformer architecture but replaces the underlying DDPM framework with a stochastic interpolant / flow matching framework, achieving further performance improvements.

6.5 Others

Hunyuan-DiT (Tencent): A text-to-image DiT with strong Chinese language understanding
Open-Sora: An open-source video generation solution based on DiT
Latte: An early work extending DiT to video generation

7. DiT vs. U-Net Comparison

维度	U-Net	DiT (Transformer)
架构来源	医学图像分割 (2015)	视觉识别 ViT (2020)
核心结构	Encoder-Decoder + Skip Connection	堆叠 Transformer Blocks
Scaling 行为	不够清晰，难以预测	平滑且可预测，类似 LLM
条件注入	Cross-Attention + Time Embed	adaLN-Zero（更高效）
多分辨率特征	天然支持（U 形结构）	需要额外设计
计算效率	中等	高（可利用 FlashAttention 等）
工程生态	成熟但特定	可复用 LLM 训练基础设施
参数效率	较低	较高（参数利用率更高）
视频扩展	需要 3D 卷积改造	自然扩展到时空 tokens
多模态扩展	困难	自然（token 拼接即可）
归纳偏置	强（局部性、多尺度）	弱（数据驱动）
小数据表现	较好	需要更多数据

Trend Assessment

Since 2023, nearly all newly proposed state-of-the-art generative models have shifted to the Transformer architecture. U-Net's dominance in the diffusion domain has given way to DiT and its variants. This trend is consistent with ViT replacing CNNs as the dominant vision backbone.

8. Reflections and Discussion

8.1 Why Can Transformers Replace U-Net?

On the surface, U-Net's multi-scale features and skip connections seem essential for image generation. However, DiT's success reveals several deeper reasons:

Attention can learn multi-scale features: Although Transformers lack explicit multi-scale structures, self-attention can adaptively attend to tokens at varying distances, implicitly achieving multi-scale modeling
Sufficient model capacity compensates for the lack of inductive bias: Consistent with the ViT vs. CNN story, when the model is large enough and data is plentiful, strong inductive biases are no longer a prerequisite
Latent space reduces resolution requirements: Operating in latent space (e.g., 32x32) rather than pixel space (e.g., 256x256) drastically reduces the number of tokens, making Transformer's $O(n^2)$ complexity manageable

8.2 DiT's Significance for Video Generation

DiT's impact on video generation is particularly far-reaching:

Natural spatiotemporal extension: Extending 2D patches to 3D spacetime patches is straightforward, as Transformers natively handle variable-length sequences
Unified training framework: The same architecture can process both images and video -- only the patch extraction method needs to change
Feasibility of scaling: Video generation demands larger models, and DiT's scaling law guarantees that scaling up yields returns

The emergence of Sora is the best validation of this approach.

8.3 Is DiT Yet Another Case for the "Unified Architecture"?

From a broader perspective, DiT represents another instance of the Transformer conquering yet another domain as a "unified architecture":

Transformer 的领域扩张路线
================================================================

  2017  NLP (原始 Transformer)
    |
  2018  NLP 预训练 (BERT, GPT)
    |
  2020  视觉识别 (ViT)
    |
  2021  多模态理解 (CLIP)
    |
  2023  图像生成 (DiT)        <--- 我们在这里
    |
  2024  视频生成 (Sora)
    |
  2024+ 音频、3D、机器人控制...

================================================================

The deeper logic behind this trend is that the Transformer's generality and scaling capabilities make it a "universal substrate for computation." Differences between tasks across domains are handled through tokenization strategies (text tokens, image patches, video patches, audio patches), while the core computation engine remains the same.

A Word of Caution

A unified architecture does not necessarily mean an optimal architecture. The $O(n^2)$ attention complexity of Transformers remains a bottleneck for very long sequences (e.g., high-resolution video). Alternative approaches such as linear attention and state space models (Mamba) continue to evolve. The success of DiT should not be over-interpreted as "Transformers are always the best choice," but rather understood as "under the current computational paradigm and data scales, Transformers are the architecture with the greatest scaling potential."

9. Summary

The core contributions of DiT can be summarized in three points:

Validated that Transformers can replace U-Net as the denoising backbone of diffusion models, achieving comparable or superior performance
Proposed adaLN-Zero, an efficient conditioning injection method that achieves stable training through zero initialization
Systematically studied scaling behavior, demonstrating that Diffusion Transformers exhibit a predictable scaling law similar to LLMs

The impact of DiT extends far beyond the paper itself. It initiated the shift in generative AI from "task-specific architectures" to a "unified architecture," laying the groundwork for subsequent works such as Sora and SD3, and providing yet another compelling piece of evidence for the thesis that "the Transformer is a universal computational architecture."