Generative Model Comparison

Overview

Generative models represent one of the most active research directions in deep learning. From VAEs to GANs, from diffusion models to Flow Matching, to autoregressive models, each paradigm has its unique design philosophy and use cases. This chapter systematically compares the five major generative model families.

graph LR
    A[Generative Models] --> B[VAE]
    A --> C[GAN]
    A --> D[Diffusion]
    A --> E[Flow]
    A --> F[Autoregressive]

    B --> B1[2013]
    C --> C1[2014]
    D --> D1[2020]
    E --> E1[2024]
    F --> F1[2016 PixelRNN]

1. Five Major Generative Models

1.1 VAE (Variational Autoencoder)

Core Idea: Learn the latent distribution of data through an encoder-decoder structure.

\[ \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z)) \]

Reconstruction term + KL divergence regularization
Continuous, interpolable latent space

1.2 GAN (Generative Adversarial Network)

Core Idea: Adversarial game between generator and discriminator.

\[ \min_G \max_D \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]

Implicit density model (no explicit distribution modeling)
High generation quality but unstable training

1.3 Diffusion Model

Core Idea: Learn data distribution by gradually adding noise then gradually denoising.

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right] \]

Stable training, extremely high generation quality
Slow sampling (requires multiple denoising steps)

1.4 Flow-based Model

Core Idea: Map simple distributions to complex ones through invertible transformations.

\[ \log p(x) = \log p(z) + \log \left|\det \frac{\partial f^{-1}}{\partial x}\right| \]

Flow Matching (2024 mainstream):

\[ \mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1}\left[\| v_\theta(x_t, t) - (x_1 - x_0) \|^2\right] \]

1.5 Autoregressive Model

Core Idea: Generate token/pixel by token, modeling conditional probability chains.

\[ p(x) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1}) \]

Exact log-likelihood
Slow generation (sequential)

2. Comprehensive Comparison

2.1 Core Dimension Comparison

Dimension	VAE	GAN	Diffusion	Flow	Autoregressive
Training Stability	Stable	Unstable	Very stable	Stable	Stable
Generation Quality	Medium (blurry)	High (sharp)	Very high	High	High
Diversity	High	Medium (mode collapse)	Very high	High	High
Sampling Speed	Fast (one step)	Fast (one step)	Slow (multi-step)	Medium	Slow (sequential)
Likelihood	Lower bound (ELBO)	None	Lower bound	Exact	Exact
Controllability	Medium	Medium	High	High	High
Mode Coverage	Good	Poor	Good	Good	Good
Memory Requirement	Low	Low	High	Medium	Medium
Current Mainstream	As component	Declining	Yes	Rising	Yes (LLM)

2.2 Training Comparison

Feature	VAE	GAN	Diffusion	Flow	Autoregressive
Loss function	ELBO	Adversarial	Simple MSE	Velocity MSE	Cross-entropy
Training objective	Reconstruct+regularize	Minimax	Denoise	Velocity matching	Next-token prediction
Hyperparameter sensitivity	Low	High	Low	Low	Low
Convergence	Good	Poor	Good	Good	Good

2.3 Architecture Comparison

Feature	VAE	GAN	Diffusion	Flow	Autoregressive
Typical architecture	CNN/Transformer	CNN	U-Net/DiT	DiT	Transformer
Latent space	Continuous	Continuous	Pixel/latent	Continuous	Discrete tokens
Conditioning	Concat/cross-attn	Conditional BN	Cross-attn/CFG	Cross-attn	Prefix/prompt

3. Decision Tree

graph TD
    A[Choose Generative Model] --> B{Task Type?}
    B -->|Text Generation| C[Autoregressive]
    B -->|Image Generation| D{Priority?}
    B -->|Video Generation| E[Diffusion/Flow]
    B -->|Representation Learning| F[VAE]

    D -->|Quality First| G{Speed Requirement?}
    D -->|Speed First| H[GAN/VAE]
    D -->|Controllability| I[Diffusion + CFG]

    G -->|Can Be Slow| J[Diffusion]
    G -->|Needs Speed| K[Flow Matching]

    C --> C1[GPT Series / LLaMA]
    J --> J1[SDXL / DALL-E 3]
    K --> K1[SD3 / Flux]
    I --> I1[ControlNet + Diffusion]

4. Historical Evolution

4.1 Timeline

Year	Milestone	Significance
2013	VAE	Variational inference + deep learning
2014	GAN	Adversarial training paradigm
2015	DCGAN	GAN + CNN
2017	WGAN	Addressed GAN training instability
2018	BigGAN	Large-scale high-quality GAN
2019	StyleGAN	Style-controlled face generation
2020	DDPM	Practical diffusion models
2021	DALL-E / CLIP	Text-to-image
2022	Stable Diffusion	Latent diffusion, open-source ecosystem
2022	Imagen	Cascaded diffusion
2023	SDXL	Higher quality
2023	Consistency Models	Few-step/one-step generation
2024	SD3 / Flux	Flow Matching replaces Diffusion
2024	Sora	Video generation

4.2 Paradigm Shifts

graph LR
    A[VAE 2013] --> B[GAN 2014-2021]
    B --> C[Diffusion 2020-2024]
    C --> D[Flow Matching 2024+]

    E[RNN 2016] --> F[Transformer AR 2020+]

    C -.-> G[Diffusion + AR Fusion]
    F -.-> G

Key Observations:

GAN era (2014-2021): Pursued generation quality but difficult to train
Diffusion era (2020-2024): Simple training, quality surpassing GANs
Flow Matching (2024+): Cleaner theory, faster sampling
Fusion trend: AR + Diffusion (e.g., Transfusion, MAR)

5. Hybrid Architectures

5.1 VAE + Diffusion (Latent Diffusion)

The core architecture of Stable Diffusion:

VAE encoder: Image → latent space
Diffusion operates in latent space
VAE decoder: Latent space → image

Advantage: Performing diffusion in low-dimensional latent space dramatically reduces computation.

5.2 AR + Diffusion

Transfusion: AR for text, Diffusion for images
MAR (Masked Autoregressive): Masked autoregressive generation
Fluid: Autoregressive with continuous tokens

5.3 GAN + Diffusion

Consistency Models: Distill diffusion models into one-step generators
GAN for acceleration: Discriminator guides diffusion to reduce steps

6. Application Recommendations

Application	Recommended Model	Rationale
Text generation	Autoregressive (LLM)	Optimal for discrete tokens
High-quality images	Diffusion / Flow	Highest quality
Real-time image generation	GAN / Consistency Models	Single-step generation
Image editing	Diffusion + guidance	Best controllability
Video generation	Diffusion / Flow	Temporal consistency
3D generation	Diffusion (SDS)	Combines with NeRF/3DGS
Music/audio	Diffusion / AR	Both applicable
Data augmentation	VAE / GAN	Fast, lightweight
Representation learning	VAE	Structured latent space
Anomaly detection	VAE / Flow	Likelihood estimation

7. Summary

Key Takeaways:

No one-size-fits-all generative model — choice depends on task, quality, speed, and controllability trade-offs
Diffusion/Flow dominates current image generation — stable training, high quality
Autoregressive dominates text generation — core LLM paradigm
Hybrid architectures are the trend — combining strengths of different models
GANs are not dead — still valuable in real-time applications and discriminator-assisted training

References

Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014
Goodfellow et al., "Generative Adversarial Nets," NeurIPS 2014
Ho et al., "Denoising Diffusion Probabilistic Models," NeurIPS 2020
Lipman et al., "Flow Matching for Generative Modeling," ICLR 2023
Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis," ICML 2024