Generative Model Comparison
Overview
Generative models represent one of the most active research directions in deep learning. From VAEs to GANs, from diffusion models to Flow Matching, to autoregressive models, each paradigm has its unique design philosophy and use cases. This chapter systematically compares the five major generative model families.
graph LR
A[Generative Models] --> B[VAE]
A --> C[GAN]
A --> D[Diffusion]
A --> E[Flow]
A --> F[Autoregressive]
B --> B1[2013]
C --> C1[2014]
D --> D1[2020]
E --> E1[2024]
F --> F1[2016 PixelRNN]
1. Five Major Generative Models
1.1 VAE (Variational Autoencoder)
Core Idea: Learn the latent distribution of data through an encoder-decoder structure.
- Reconstruction term + KL divergence regularization
- Continuous, interpolable latent space
1.2 GAN (Generative Adversarial Network)
Core Idea: Adversarial game between generator and discriminator.
- Implicit density model (no explicit distribution modeling)
- High generation quality but unstable training
1.3 Diffusion Model
Core Idea: Learn data distribution by gradually adding noise then gradually denoising.
- Stable training, extremely high generation quality
- Slow sampling (requires multiple denoising steps)
1.4 Flow-based Model
Core Idea: Map simple distributions to complex ones through invertible transformations.
Flow Matching (2024 mainstream):
1.5 Autoregressive Model
Core Idea: Generate token/pixel by token, modeling conditional probability chains.
- Exact log-likelihood
- Slow generation (sequential)
2. Comprehensive Comparison
2.1 Core Dimension Comparison
| Dimension | VAE | GAN | Diffusion | Flow | Autoregressive |
|---|---|---|---|---|---|
| Training Stability | Stable | Unstable | Very stable | Stable | Stable |
| Generation Quality | Medium (blurry) | High (sharp) | Very high | High | High |
| Diversity | High | Medium (mode collapse) | Very high | High | High |
| Sampling Speed | Fast (one step) | Fast (one step) | Slow (multi-step) | Medium | Slow (sequential) |
| Likelihood | Lower bound (ELBO) | None | Lower bound | Exact | Exact |
| Controllability | Medium | Medium | High | High | High |
| Mode Coverage | Good | Poor | Good | Good | Good |
| Memory Requirement | Low | Low | High | Medium | Medium |
| Current Mainstream | As component | Declining | Yes | Rising | Yes (LLM) |
2.2 Training Comparison
| Feature | VAE | GAN | Diffusion | Flow | Autoregressive |
|---|---|---|---|---|---|
| Loss function | ELBO | Adversarial | Simple MSE | Velocity MSE | Cross-entropy |
| Training objective | Reconstruct+regularize | Minimax | Denoise | Velocity matching | Next-token prediction |
| Hyperparameter sensitivity | Low | High | Low | Low | Low |
| Convergence | Good | Poor | Good | Good | Good |
2.3 Architecture Comparison
| Feature | VAE | GAN | Diffusion | Flow | Autoregressive |
|---|---|---|---|---|---|
| Typical architecture | CNN/Transformer | CNN | U-Net/DiT | DiT | Transformer |
| Latent space | Continuous | Continuous | Pixel/latent | Continuous | Discrete tokens |
| Conditioning | Concat/cross-attn | Conditional BN | Cross-attn/CFG | Cross-attn | Prefix/prompt |
3. Decision Tree
graph TD
A[Choose Generative Model] --> B{Task Type?}
B -->|Text Generation| C[Autoregressive]
B -->|Image Generation| D{Priority?}
B -->|Video Generation| E[Diffusion/Flow]
B -->|Representation Learning| F[VAE]
D -->|Quality First| G{Speed Requirement?}
D -->|Speed First| H[GAN/VAE]
D -->|Controllability| I[Diffusion + CFG]
G -->|Can Be Slow| J[Diffusion]
G -->|Needs Speed| K[Flow Matching]
C --> C1[GPT Series / LLaMA]
J --> J1[SDXL / DALL-E 3]
K --> K1[SD3 / Flux]
I --> I1[ControlNet + Diffusion]
4. Historical Evolution
4.1 Timeline
| Year | Milestone | Significance |
|---|---|---|
| 2013 | VAE | Variational inference + deep learning |
| 2014 | GAN | Adversarial training paradigm |
| 2015 | DCGAN | GAN + CNN |
| 2017 | WGAN | Addressed GAN training instability |
| 2018 | BigGAN | Large-scale high-quality GAN |
| 2019 | StyleGAN | Style-controlled face generation |
| 2020 | DDPM | Practical diffusion models |
| 2021 | DALL-E / CLIP | Text-to-image |
| 2022 | Stable Diffusion | Latent diffusion, open-source ecosystem |
| 2022 | Imagen | Cascaded diffusion |
| 2023 | SDXL | Higher quality |
| 2023 | Consistency Models | Few-step/one-step generation |
| 2024 | SD3 / Flux | Flow Matching replaces Diffusion |
| 2024 | Sora | Video generation |
4.2 Paradigm Shifts
graph LR
A[VAE 2013] --> B[GAN 2014-2021]
B --> C[Diffusion 2020-2024]
C --> D[Flow Matching 2024+]
E[RNN 2016] --> F[Transformer AR 2020+]
C -.-> G[Diffusion + AR Fusion]
F -.-> G
Key Observations:
- GAN era (2014-2021): Pursued generation quality but difficult to train
- Diffusion era (2020-2024): Simple training, quality surpassing GANs
- Flow Matching (2024+): Cleaner theory, faster sampling
- Fusion trend: AR + Diffusion (e.g., Transfusion, MAR)
5. Hybrid Architectures
5.1 VAE + Diffusion (Latent Diffusion)
The core architecture of Stable Diffusion:
- VAE encoder: Image → latent space
- Diffusion operates in latent space
- VAE decoder: Latent space → image
Advantage: Performing diffusion in low-dimensional latent space dramatically reduces computation.
5.2 AR + Diffusion
- Transfusion: AR for text, Diffusion for images
- MAR (Masked Autoregressive): Masked autoregressive generation
- Fluid: Autoregressive with continuous tokens
5.3 GAN + Diffusion
- Consistency Models: Distill diffusion models into one-step generators
- GAN for acceleration: Discriminator guides diffusion to reduce steps
6. Application Recommendations
| Application | Recommended Model | Rationale |
|---|---|---|
| Text generation | Autoregressive (LLM) | Optimal for discrete tokens |
| High-quality images | Diffusion / Flow | Highest quality |
| Real-time image generation | GAN / Consistency Models | Single-step generation |
| Image editing | Diffusion + guidance | Best controllability |
| Video generation | Diffusion / Flow | Temporal consistency |
| 3D generation | Diffusion (SDS) | Combines with NeRF/3DGS |
| Music/audio | Diffusion / AR | Both applicable |
| Data augmentation | VAE / GAN | Fast, lightweight |
| Representation learning | VAE | Structured latent space |
| Anomaly detection | VAE / Flow | Likelihood estimation |
7. Summary
Key Takeaways:
- No one-size-fits-all generative model — choice depends on task, quality, speed, and controllability trade-offs
- Diffusion/Flow dominates current image generation — stable training, high quality
- Autoregressive dominates text generation — core LLM paradigm
- Hybrid architectures are the trend — combining strengths of different models
- GANs are not dead — still valuable in real-time applications and discriminator-assisted training
References
- Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014
- Goodfellow et al., "Generative Adversarial Nets," NeurIPS 2014
- Ho et al., "Denoising Diffusion Probabilistic Models," NeurIPS 2020
- Lipman et al., "Flow Matching for Generative Modeling," ICLR 2023
- Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis," ICML 2024