Skip to content

Generative Model Comparison

Overview

Generative models represent one of the most active research directions in deep learning. From VAEs to GANs, from diffusion models to Flow Matching, to autoregressive models, each paradigm has its unique design philosophy and use cases. This chapter systematically compares the five major generative model families.

graph LR
    A[Generative Models] --> B[VAE]
    A --> C[GAN]
    A --> D[Diffusion]
    A --> E[Flow]
    A --> F[Autoregressive]

    B --> B1[2013]
    C --> C1[2014]
    D --> D1[2020]
    E --> E1[2024]
    F --> F1[2016 PixelRNN]

1. Five Major Generative Models

1.1 VAE (Variational Autoencoder)

Core Idea: Learn the latent distribution of data through an encoder-decoder structure.

\[ \mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z)) \]
  • Reconstruction term + KL divergence regularization
  • Continuous, interpolable latent space

1.2 GAN (Generative Adversarial Network)

Core Idea: Adversarial game between generator and discriminator.

\[ \min_G \max_D \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]
  • Implicit density model (no explicit distribution modeling)
  • High generation quality but unstable training

1.3 Diffusion Model

Core Idea: Learn data distribution by gradually adding noise then gradually denoising.

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right] \]
  • Stable training, extremely high generation quality
  • Slow sampling (requires multiple denoising steps)

1.4 Flow-based Model

Core Idea: Map simple distributions to complex ones through invertible transformations.

\[ \log p(x) = \log p(z) + \log \left|\det \frac{\partial f^{-1}}{\partial x}\right| \]

Flow Matching (2024 mainstream):

\[ \mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1}\left[\| v_\theta(x_t, t) - (x_1 - x_0) \|^2\right] \]

1.5 Autoregressive Model

Core Idea: Generate token/pixel by token, modeling conditional probability chains.

\[ p(x) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1}) \]
  • Exact log-likelihood
  • Slow generation (sequential)

2. Comprehensive Comparison

2.1 Core Dimension Comparison

Dimension VAE GAN Diffusion Flow Autoregressive
Training Stability Stable Unstable Very stable Stable Stable
Generation Quality Medium (blurry) High (sharp) Very high High High
Diversity High Medium (mode collapse) Very high High High
Sampling Speed Fast (one step) Fast (one step) Slow (multi-step) Medium Slow (sequential)
Likelihood Lower bound (ELBO) None Lower bound Exact Exact
Controllability Medium Medium High High High
Mode Coverage Good Poor Good Good Good
Memory Requirement Low Low High Medium Medium
Current Mainstream As component Declining Yes Rising Yes (LLM)

2.2 Training Comparison

Feature VAE GAN Diffusion Flow Autoregressive
Loss function ELBO Adversarial Simple MSE Velocity MSE Cross-entropy
Training objective Reconstruct+regularize Minimax Denoise Velocity matching Next-token prediction
Hyperparameter sensitivity Low High Low Low Low
Convergence Good Poor Good Good Good

2.3 Architecture Comparison

Feature VAE GAN Diffusion Flow Autoregressive
Typical architecture CNN/Transformer CNN U-Net/DiT DiT Transformer
Latent space Continuous Continuous Pixel/latent Continuous Discrete tokens
Conditioning Concat/cross-attn Conditional BN Cross-attn/CFG Cross-attn Prefix/prompt

3. Decision Tree

graph TD
    A[Choose Generative Model] --> B{Task Type?}
    B -->|Text Generation| C[Autoregressive]
    B -->|Image Generation| D{Priority?}
    B -->|Video Generation| E[Diffusion/Flow]
    B -->|Representation Learning| F[VAE]

    D -->|Quality First| G{Speed Requirement?}
    D -->|Speed First| H[GAN/VAE]
    D -->|Controllability| I[Diffusion + CFG]

    G -->|Can Be Slow| J[Diffusion]
    G -->|Needs Speed| K[Flow Matching]

    C --> C1[GPT Series / LLaMA]
    J --> J1[SDXL / DALL-E 3]
    K --> K1[SD3 / Flux]
    I --> I1[ControlNet + Diffusion]

4. Historical Evolution

4.1 Timeline

Year Milestone Significance
2013 VAE Variational inference + deep learning
2014 GAN Adversarial training paradigm
2015 DCGAN GAN + CNN
2017 WGAN Addressed GAN training instability
2018 BigGAN Large-scale high-quality GAN
2019 StyleGAN Style-controlled face generation
2020 DDPM Practical diffusion models
2021 DALL-E / CLIP Text-to-image
2022 Stable Diffusion Latent diffusion, open-source ecosystem
2022 Imagen Cascaded diffusion
2023 SDXL Higher quality
2023 Consistency Models Few-step/one-step generation
2024 SD3 / Flux Flow Matching replaces Diffusion
2024 Sora Video generation

4.2 Paradigm Shifts

graph LR
    A[VAE 2013] --> B[GAN 2014-2021]
    B --> C[Diffusion 2020-2024]
    C --> D[Flow Matching 2024+]

    E[RNN 2016] --> F[Transformer AR 2020+]

    C -.-> G[Diffusion + AR Fusion]
    F -.-> G

Key Observations:

  1. GAN era (2014-2021): Pursued generation quality but difficult to train
  2. Diffusion era (2020-2024): Simple training, quality surpassing GANs
  3. Flow Matching (2024+): Cleaner theory, faster sampling
  4. Fusion trend: AR + Diffusion (e.g., Transfusion, MAR)

5. Hybrid Architectures

5.1 VAE + Diffusion (Latent Diffusion)

The core architecture of Stable Diffusion:

  1. VAE encoder: Image → latent space
  2. Diffusion operates in latent space
  3. VAE decoder: Latent space → image

Advantage: Performing diffusion in low-dimensional latent space dramatically reduces computation.

5.2 AR + Diffusion

  • Transfusion: AR for text, Diffusion for images
  • MAR (Masked Autoregressive): Masked autoregressive generation
  • Fluid: Autoregressive with continuous tokens

5.3 GAN + Diffusion

  • Consistency Models: Distill diffusion models into one-step generators
  • GAN for acceleration: Discriminator guides diffusion to reduce steps

6. Application Recommendations

Application Recommended Model Rationale
Text generation Autoregressive (LLM) Optimal for discrete tokens
High-quality images Diffusion / Flow Highest quality
Real-time image generation GAN / Consistency Models Single-step generation
Image editing Diffusion + guidance Best controllability
Video generation Diffusion / Flow Temporal consistency
3D generation Diffusion (SDS) Combines with NeRF/3DGS
Music/audio Diffusion / AR Both applicable
Data augmentation VAE / GAN Fast, lightweight
Representation learning VAE Structured latent space
Anomaly detection VAE / Flow Likelihood estimation

7. Summary

Key Takeaways:

  1. No one-size-fits-all generative model — choice depends on task, quality, speed, and controllability trade-offs
  2. Diffusion/Flow dominates current image generation — stable training, high quality
  3. Autoregressive dominates text generation — core LLM paradigm
  4. Hybrid architectures are the trend — combining strengths of different models
  5. GANs are not dead — still valuable in real-time applications and discriminator-assisted training

References

  • Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014
  • Goodfellow et al., "Generative Adversarial Nets," NeurIPS 2014
  • Ho et al., "Denoising Diffusion Probabilistic Models," NeurIPS 2020
  • Lipman et al., "Flow Matching for Generative Modeling," ICLR 2023
  • Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis," ICML 2024

评论 #