GAN (Generative Adversarial Network)
GANs (Generative Adversarial Networks) were proposed by Ian Goodfellow et al. in the 2014 paper "Generative Adversarial Nets." The core idea is to have two neural networks — a Generator and a Discriminator — compete against each other to learn the data distribution and produce realistic samples. The advent of GANs ushered in a new era for deep generative models, and was famously described by Yann LeCun as "the most interesting idea in machine learning in the last ten years."
Background and Motivation
A Review of Generative Models
The central goal of deep generative models is to learn the data distribution \(p_{data}(x)\) and then sample from it to generate new data. Based on the modeling approach, generative models can be broadly divided into two categories:
Explicit Density Estimation: Directly define and optimize a probability density function \(p_\theta(x)\).
- Autoregressive models (e.g., PixelRNN/PixelCNN): Decompose the joint distribution into a product of conditional distributions, generating one pixel at a time
- Variational Autoencoders (VAE): Optimize a lower bound on the log-likelihood (ELBO) through variational inference
- Flow models: Compute the exact likelihood through invertible transformations
Implicit Density Estimation: Instead of explicitly defining a probability density function, directly learn a mapping from noise to data.
- GANs are the representative example: we never know the analytical form of \(p_g(x)\), but we can sample from it
显式 vs 隐式的直觉理解
Explicit methods are like saying "let me write a mathematical formula to describe what a cat looks like" — you can both compute probabilities and generate samples. Implicit methods are like saying "let me train a painter who doesn't understand probability formulas, but whose cats look convincingly real" — you can only generate, but cannot compute the exact probability of a given image.
The Core Idea of GANs: A Game-Theoretic Perspective
GANs draw inspiration from two-player minimax games in game theory. The two networks each have their own objective:
- Generator G: Produce samples that are as realistic as possible, fooling the discriminator
- Discriminator D: Distinguish between real and generated samples as accurately as possible
The competition between the two drives mutual improvement, ultimately reaching a Nash equilibrium where the generated data is indistinguishable from real data.
The "Counterfeiter vs. Police" Analogy
Goodfellow offered an intuitive analogy in the original paper:
- Generator G = Counterfeiter: Continuously refines counterfeiting techniques, trying to make fake currency ever more convincing
- Discriminator D = Police: Continuously improves detection capabilities, trying to identify every counterfeit bill
The "arms race" between counterfeiter and police drives both sides to improve. The ideal end state is when the counterfeiter's technique is so perfect that the police cannot distinguish real from fake — at which point counterfeit currency is equivalent to genuine currency.
GAN Architecture in Detail
Overall Architecture
随机噪声 真实数据
z ~ N(0, I) x ~ p_data
│ │
▼ │
┌──────────────┐ │
│ Generator G │ │
│ (神经网络) │ │
└──────┬───────┘ │
│ │
▼ ▼
G(z) x
(生成样本) (真实样本)
│ │
└──────────┬─────────────────┘
│
▼
┌─────────────────┐
│ Discriminator D │
│ (神经网络) │
└────────┬────────┘
│
▼
D(·) ∈ [0, 1]
(真实概率的估计)
Generator G
The generator is a mapping from a low-dimensional latent space to a high-dimensional data space:
- Input: A random noise vector \(z \sim p_z(z)\), typically \(z \sim \mathcal{N}(0, I)\) or \(z \sim \text{Uniform}(-1, 1)\)
- Output: A generated fake sample \(G(z)\), with the same dimensionality as real data
- Objective: Make the distribution \(p_g\) of \(G(z)\) as close as possible to the real data distribution \(p_{data}\)
In image generation tasks, G typically uses transposed convolutions to progressively upsample low-dimensional noise into high-resolution images.
Discriminator D
The discriminator is a binary classifier:
- Input: A sample (which may be a real sample \(x\) or a generated sample \(G(z)\))
- Output: An estimate of the probability that the sample comes from the real data
- Objective: Output high probability (close to 1) for real samples and low probability (close to 0) for generated samples
Adversarial Training Objective
GAN training is formalized as the following minimax game:
Breaking down this objective function:
- \(\mathbb{E}_{x \sim p_{data}}[\log D(x)]\): The log-probability that D classifies real samples as "real." D wants to maximize this term (pushing \(D(x) \to 1\))
- \(\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\): The log-probability that D classifies generated samples as "fake." D wants to maximize this term (pushing \(D(G(z)) \to 0\)), while G wants to minimize it (pushing \(D(G(z)) \to 1\))
目标函数的直觉
The discriminator D wants to make \(V(D,G)\) as large as possible — correctly identifying genuine samples and catching counterfeits. The generator G wants to make \(V(D,G)\) as small as possible — making the discriminator mistake counterfeits for genuine samples. This is the meaning of \(\min_G \max_D\).
Training Process in Detail
Alternating Training
GAN training employs an alternating optimization strategy. In each iteration:
Step 1: Train the Discriminator D (k steps)
Fix G and update D's parameters to maximize \(V(D, G)\):
- Sample a mini-batch \(\{x^{(1)}, \ldots, x^{(m)}\}\) from the real data
- Sample a mini-batch \(\{z^{(1)}, \ldots, z^{(m)}\}\) from the noise distribution
- Update D via gradient ascent:
Step 2: Train the Generator G (1 step)
Fix D and update G's parameters to minimize \(V(D, G)\):
- Sample a mini-batch \(\{z^{(1)}, \ldots, z^{(m)}\}\) from the noise distribution
- Update G via gradient descent:
为什么先训练 D 多步?
If D is too weak, it cannot provide meaningful gradient signals to G. Intuitively, if the "police" cannot even tell real from fake, the "counterfeiter" has no way of knowing which direction to improve. The original paper therefore recommends training D for k steps before each step of G training (the paper found k=1 sufficient, but in practice more steps are sometimes needed).
Optimal Discriminator
Given a fixed generator G, the optimal discriminator \(D^*\) has a closed-form solution:
Derivation: For a fixed G, the optimization of \(V(D, G)\) with respect to D can be written as:
Taking the derivative of the integrand with respect to \(D(x)\) and setting it to zero:
Solving yields \(D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\).
Global Optimum
Substituting \(D^*\) back into the objective function, one can show that:
where \(D_{JS}\) is the Jensen-Shannon (JS) Divergence:
Since JS divergence is always non-negative and equals zero if and only if \(p_{data} = p_g\):
- The global optimum is achieved when \(p_g = p_{data}\)
- At this point, \(V(D^*, G) = -\log 4\)
- At this point, \(D^*(x) = \frac{1}{2}\), meaning the discriminator outputs 0.5 for every sample — completely unable to distinguish real from fake
Non-Saturating Loss
实际训练中 G 的梯度问题
Early in training, G produces poor samples and \(D(G(z)) \approx 0\). In this regime, \(\log(1 - D(G(z))) \approx \log(1) = 0\), so the gradient is nearly zero and G can hardly learn. This is known as the saturation problem.
The original objective has G minimize \(\log(1 - D(G(z)))\), but in practice it is typically replaced with maximizing \(\log D(G(z))\):
| Objective | G's Loss Function | Gradient Behavior |
|---|---|---|
| Original (Minimax) | \(\log(1 - D(G(z)))\) | Vanishing gradient when \(D(G(z)) \to 0\) |
| Alternative (Non-saturating) | \(-\log D(G(z))\) | Large gradient when \(D(G(z)) \to 0\) |
The two objectives have the same gradient magnitude at \(D(G(z)) = 0.5\), but the non-saturating version provides much stronger gradient signals early in training, greatly improving training stability.
Numerical Example: Fitting a 1D Gaussian
To build intuition for the GAN training process, consider a simplified example: the real data follows a one-dimensional Gaussian \(p_{data} = \mathcal{N}(5, 1)\), and the generator learns this distribution starting from uniform noise \(z \sim \text{Uniform}(0, 1)\).
Suppose G is a simple linear transformation \(G(z) = az + b\), and D is a small neural network.
Visualizing the Training Process
| Stage | G's Output Distribution | D's Behavior | Description |
|---|---|---|---|
| Initial | \(G(z) \sim \text{Uniform}(0, 1)\) (far from target) | \(D\) easily distinguishes real from fake, \(D(x_{real}) \approx 1\), \(D(G(z)) \approx 0\) | G's samples look nothing like real data |
| Step 100 | Mean of \(G(z)\) distribution starts shifting toward 5 | \(D\) can still distinguish reasonably well, but accuracy drops | G has learned the approximate location |
| Step 500 | \(G(z) \approx \mathcal{N}(4.5, 0.8)\) | Discrimination becomes difficult, \(D(\cdot) \approx 0.6\text{--}0.7\) | G is approaching the target distribution |
| Converged | \(G(z) \approx \mathcal{N}(5, 1)\) | \(D(x) \approx 0.5\) (for all inputs) | Nash equilibrium reached |
Loss Trends
Loss
│
│ D_loss
│ ╲
│ ╲ ___________
│ ╲______╱ ─── → ln(2) ≈ 0.693
│
│ ╱‾‾‾‾‾‾‾‾‾‾‾‾‾
│ _____╱
│ ╱
│ ╱ G_loss
│
└──────────────────────────→ Training Steps
In the ideal case, when training reaches equilibrium:
- \(D_{loss} = -[\log(0.5) + \log(0.5)] = \log 4 \approx 1.386\) (averaged per sample: \(\log 2 \approx 0.693\))
- \(G_{loss} = -\log(0.5) = \log 2 \approx 0.693\)
Training Difficulties and Solutions
GAN training is notoriously unstable. Below are several core challenges and corresponding strategies.
Mode Collapse
Problem: The generator learns to produce only a few types of samples while ignoring other modes of the data distribution. For example, on the MNIST dataset, G might only generate digits "1" and "7" while neglecting all other digits.
Cause: G discovers that certain samples are particularly effective at fooling D, and "takes the shortcut" of only generating those samples. From an optimization perspective, this occurs because G's objective function does not directly penalize lack of diversity.
Solutions:
- Minibatch Discrimination: Let D examine a batch of generated samples simultaneously and assess their diversity
- Unrolled GAN: Let G "look ahead" several steps of D's future updates during optimization
- WGAN: Replace JS divergence with the Wasserstein distance, fundamentally alleviating the problem
Training Instability
Problem: When D is too strong, G receives near-zero gradients (vanishing gradients); when D is too weak, the gradient signals G receives are meaningless. The two networks must maintain a delicate balance.
Analogy: It is like a student-teacher chess match — if the teacher is too strong, the student learns nothing (crushed at every move); if the teacher is too weak, the student also learns nothing (wins effortlessly every time).
Solutions:
- Carefully tune the learning rates and training step ratios for D and G
- Use Spectral Normalization to constrain D's Lipschitz constant
- Use the Two-Timescale Update Rule (TTUR): a larger learning rate for D and a smaller one for G
Vanishing Gradients
Problem: When D is trained too well, \(D(G(z)) \approx 0\), causing the original loss \(\log(1 - D(G(z))) \approx 0\), and G can barely receive gradient updates.
Mathematical Analysis: Under the optimal discriminator \(D^*\), when the supports of \(p_g\) and \(p_{data}\) do not overlap (which almost certainly happens in high-dimensional spaces), the JS divergence becomes a constant \(\log 2\), and the gradient is zero.
Solutions:
- Use the non-saturating loss
- Use the Wasserstein distance (which provides meaningful gradients even when distributions do not overlap)
- Add noise to the inputs (Instance Noise) to make the supports of the real and fake distributions overlap
Evaluation Difficulty
Problem: Unlike classification tasks with accuracy or regression tasks with MSE, GANs lack a single reliable metric for measuring generation quality. The training loss itself does not reflect generation quality — oscillating D and G losses do not necessarily indicate training failure.
Common Evaluation Metrics:
| Metric | What It Measures | Pros | Cons |
|---|---|---|---|
| IS (Inception Score) | Generation quality + diversity | Simple to compute | Does not compare against real data |
| FID (Frechet Inception Distance) | Distance between generated and real distributions | Highly correlated with human judgment | Requires a large number of samples |
| Precision & Recall | Quality vs. coverage trade-off | Separates two dimensions | Computationally expensive |
FID 的直觉
FID extracts features from both real and generated images at an intermediate layer of the Inception network, models each set of features as a Gaussian distribution, and then computes the Frechet distance between the two Gaussians. A lower FID indicates that the generated images are closer to the real images.
GAN Variants
After the basic GAN framework was introduced, a large number of improved variants emerged, advancing generative models along multiple fronts including architecture design, loss functions, and training strategies.
DCGAN (Deep Convolutional GAN, 2015)
DCGAN was the first work to successfully incorporate CNNs into GANs, proposing a set of architectural guidelines:
- Replace pooling layers with strided convolutions
- Use Batch Normalization in both G and D (except in D's input layer and G's output layer)
- Remove fully connected layers in favor of a fully convolutional architecture
- Use ReLU activations in G (with Tanh in the output layer) and LeakyReLU in D
Generator 架构(DCGAN):
z ∈ R^100 → FC → Reshape(4×4×1024) → ConvT(512) → ConvT(256) → ConvT(128) → ConvT(3) → 64×64×3 图像
(每层都有 BN + ReLU,输出层用 Tanh)
Discriminator 架构(DCGAN):
64×64×3 图像 → Conv(128) → Conv(256) → Conv(512) → Conv(1024) → FC → Sigmoid
(每层都有 BN + LeakyReLU,输入层无 BN)
The significance of DCGAN lies in demonstrating that GANs can generate reasonably high-quality images and learn meaningful latent space representations (e.g., man with glasses - man + woman = woman with glasses).
WGAN (Wasserstein GAN, 2017)
WGAN was a major breakthrough in GAN training stability. Its core modification replaces JS divergence with the Wasserstein-1 distance (Earth Mover's Distance):
Intuitively, the Wasserstein distance measures "the minimum amount of work required to move a pile of dirt from distribution \(p_g\) to distribution \(p_{data}\)."
Through the Kantorovich-Rubinstein duality, the WGAN objective becomes:
Key modifications:
- D (now called the Critic) no longer applies a Sigmoid to its output — instead of outputting a probability, it outputs a real-valued score
- Weight Clipping (clamping D's parameters to \([-c, c]\)) is used to approximately enforce the Lipschitz constraint
- Batch Normalization is removed from D (as it would violate the Lipschitz constraint)
- The RMSProp optimizer is used (momentum-based optimizers like Adam are avoided)
WGAN 解决了什么?
JS divergence equals a constant \(\log 2\) when the two distributions do not overlap, yielding zero gradients. The Wasserstein distance, by contrast, is continuous and differentiable even when distributions do not overlap, consistently providing meaningful gradient signals to G. Additionally, the Critic's loss can serve as an indicator of training progress — it correlates positively with generation quality.
WGAN-GP (WGAN with Gradient Penalty, 2017)
The weight clipping in WGAN is a crude approach that can cause weights to concentrate at the clipping boundaries \(\{-c, c\}\), limiting the model's expressive capacity.
WGAN-GP proposes replacing weight clipping with a gradient penalty:
Here \(\hat{x}\) is a random interpolation between real and generated samples, and \(\lambda\) is typically set to 10.
The gradient penalty enforces the 1-Lipschitz constraint across the entire space (keeping gradient norms close to 1), resulting in more stable training and better performance.
Conditional GAN (cGAN, 2014)
Conditional GANs introduce additional conditioning information \(y\) (such as class labels) into both G and D:
This enables control over the generated content. For example, given the condition \(y = \text{"digit 5"}\), G will specifically generate images of the digit 5.
Pix2Pix (2016)
Pix2Pix applies cGAN to paired image-to-image translation tasks (e.g., semantic segmentation maps \(\to\) photographs, line drawings \(\to\) color images).
- G uses a U-Net architecture (an encoder-decoder with skip connections)
- D uses PatchGAN: instead of producing a single real/fake judgment for the entire image, it independently classifies each local patch as real or fake
- Loss = cGAN loss + L1 reconstruction loss
CycleGAN (2017)
CycleGAN addresses unpaired image translation (e.g., horses \(\leftrightarrow\) zebras, summer \(\leftrightarrow\) winter).
The core idea is cycle consistency loss:
Here G translates images from domain A to domain B, and F translates from domain B back to domain A. Cycle consistency requires that translating an image to the other domain and back should recover the original image.
StyleGAN (2018-2021)
The StyleGAN series (StyleGAN, StyleGAN2, StyleGAN3) achieved breakthrough results in high-quality face generation.
Core innovations:
- Mapping network: \(z \to w\), first mapping noise through an 8-layer MLP to an intermediate latent space \(\mathcal{W}\)
- Adaptive Instance Normalization (AdaIN): Using \(w\) to control the generation style at each layer
- Noise injection: Injecting random noise at each layer to control fine details (e.g., hair texture, skin pores)
- Progressive training (StyleGAN1) / skip connections (StyleGAN2)
StyleGAN2 generates face images at 1024x1024 resolution that are virtually indistinguishable from real photographs to the human eye.
Progressive GAN (2017)
Progressive GAN introduced progressive growing training:
训练阶段:4×4 → 8×8 → 16×16 → 32×32 → ... → 1024×1024
Training begins at low resolution and progressively adds layers and increases resolution. This approach makes training more stable because:
- The low-resolution stages quickly learn global structure
- The high-resolution stages only need to learn fine details
- It avoids the difficulty of optimizing in high-dimensional space from the start
GAN vs. VAE vs. Diffusion Comparison
| Dimension | GAN | VAE | Diffusion |
|---|---|---|---|
| Modeling Approach | Implicit density (adversarial training) | Explicit density (variational inference) | Explicit density (denoising score matching) |
| Generation Quality | High (sharp, realistic) | Medium (often blurry) | Very high (rich in detail) |
| Training Stability | Poor (requires careful tuning) | Good (stable convergence) | Good (simple MSE loss) |
| Mode Coverage | Poor (mode collapse) | Good (likelihood optimization) | Very good (likelihood optimization) |
| Generation Speed | Fast (single forward pass) | Fast (single forward pass) | Slow (requires many iterative denoising steps) |
| Controllability | Medium (requires conditioning mechanisms) | Good (explicit latent space) | Good (Classifier-free Guidance) |
| Likelihood Computation | Not computable | Lower bound computable (ELBO) | Lower bound computable |
| Typical Applications | Image super-resolution, style transfer | Representation learning, anomaly detection | Text-to-image generation |
| Representative Models | StyleGAN, BigGAN | VQ-VAE, DALL-E 1 | Stable Diffusion, DALL-E 2/3 |
一句话总结
GANs generate fast and sharp results but are difficult to train; VAEs train stably but produce blurry outputs; Diffusion models achieve the highest quality but are the slowest to generate.
Reflections and Discussion
Why Has Diffusion Replaced GANs?
Starting with DDPM in 2020, diffusion models have progressively displaced GANs as the dominant approach in image generation. The main reasons include:
-
Training stability: The training objective of diffusion models is a simple MSE denoising loss, free from the balancing act between G and D inherent in adversarial training. Anyone can achieve good results with a standard training pipeline, whereas GAN training requires numerous tricks and experience.
-
Mode coverage: Diffusion models optimize the log-likelihood (or its lower bound), which naturally encourages coverage of all modes in the data distribution. The mode collapse problem in GANs has never been perfectly resolved.
-
Scalability: Diffusion models perform better with large-scale data and large models, and their scaling behavior is more predictable (analogous to the Scaling Laws observed in LLMs).
-
Controllable generation: Techniques such as Classifier-free Guidance make conditional generation with diffusion models highly flexible, driving the success of products like DALL-E 2, Stable Diffusion, and Midjourney.
Where Are GANs Still Important?
Despite yielding the "main stage" of image generation to diffusion models, GANs retain irreplaceable advantages in several scenarios:
- Real-time applications: GANs require only a single forward pass to generate, which is far faster than diffusion models that need tens to hundreds of denoising steps. GANs remain the first choice in video games, real-time style transfer, and mobile applications.
- Image super-resolution: Models like ESRGAN remain the mainstream approach for super-resolution to this day.
- Image editing and manipulation: GAN Inversion techniques can map real images back into the latent space for editing (e.g., changing age, expression, hairstyle).
- Data augmentation: Using GANs to generate synthetic data for training set expansion is particularly valuable in data-scarce domains such as medical imaging.
- Accelerating diffusion models: When distilling diffusion models into single-step generators, adversarial training is often employed (e.g., Adversarial Diffusion Distillation in SDXL-Turbo).
The Broader Impact of Adversarial Training
The adversarial training paradigm introduced by GANs extends far beyond image generation, profoundly influencing multiple areas of machine learning:
- Adversarial examples and robustness: Adversarial training is a core method for improving model robustness
- Domain adaptation: Learning domain-invariant features through adversarial training (e.g., DANN)
- Text generation: SeqGAN and related work brought GAN concepts to discrete sequence generation (though with limited success)
- Reinforcement learning: GAIL (Generative Adversarial Imitation Learning) uses adversarial training for imitation learning
- Fairness: Adversarial training is used to remove sensitive attribute information from models
- Privacy protection: Adversarial training has been applied in differential privacy and federated learning
历史地位
GANs may no longer be the optimal approach for image generation, but their contribution to the field of deep learning is lasting: they proved the viability of implicit density modeling, pioneered the adversarial training paradigm, and inspired countless subsequent works. As Goodfellow himself has noted, the most important contribution of GANs is not any specific model, but an entirely new training methodology.
References
- Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
- Radford, A., et al. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN). ICLR.
- Arjovsky, M., et al. (2017). Wasserstein GAN. ICML.
- Gulrajani, I., et al. (2017). Improved Training of Wasserstein GANs (WGAN-GP). NeurIPS.
- Mirza, M. & Osindero, S. (2014). Conditional Generative Adversarial Nets. arXiv.
- Isola, P., et al. (2016). Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix). CVPR.
- Zhu, J.-Y., et al. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN). ICCV.
- Karras, T., et al. (2018). Progressive Growing of GANs. ICLR.
- Karras, T., et al. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN). CVPR.
- Karras, T., et al. (2020). Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2). CVPR.