GAN (Generative Adversarial Network)

GANs (Generative Adversarial Networks) were proposed by Ian Goodfellow et al. in the 2014 paper "Generative Adversarial Nets." The core idea is to have two neural networks — a Generator and a Discriminator — compete against each other to learn the data distribution and produce realistic samples. The advent of GANs ushered in a new era for deep generative models, and was famously described by Yann LeCun as "the most interesting idea in machine learning in the last ten years."

Background and Motivation

A Review of Generative Models

The central goal of deep generative models is to learn the data distribution \(p_{data}(x)\) and then sample from it to generate new data. Based on the modeling approach, generative models can be broadly divided into two categories:

Explicit Density Estimation: Directly define and optimize a probability density function \(p_\theta(x)\).

Autoregressive models (e.g., PixelRNN/PixelCNN): Decompose the joint distribution into a product of conditional distributions, generating one pixel at a time
Variational Autoencoders (VAE): Optimize a lower bound on the log-likelihood (ELBO) through variational inference
Flow models: Compute the exact likelihood through invertible transformations

Implicit Density Estimation: Instead of explicitly defining a probability density function, directly learn a mapping from noise to data.

GANs are the representative example: we never know the analytical form of \(p_g(x)\), but we can sample from it

显式 vs 隐式的直觉理解

Explicit methods are like saying "let me write a mathematical formula to describe what a cat looks like" — you can both compute probabilities and generate samples. Implicit methods are like saying "let me train a painter who doesn't understand probability formulas, but whose cats look convincingly real" — you can only generate, but cannot compute the exact probability of a given image.

The Core Idea of GANs: A Game-Theoretic Perspective

GANs draw inspiration from two-player minimax games in game theory. The two networks each have their own objective:

Generator G: Produce samples that are as realistic as possible, fooling the discriminator
Discriminator D: Distinguish between real and generated samples as accurately as possible

The competition between the two drives mutual improvement, ultimately reaching a Nash equilibrium where the generated data is indistinguishable from real data.

The "Counterfeiter vs. Police" Analogy

Goodfellow offered an intuitive analogy in the original paper:

Generator G = Counterfeiter: Continuously refines counterfeiting techniques, trying to make fake currency ever more convincing
Discriminator D = Police: Continuously improves detection capabilities, trying to identify every counterfeit bill

The "arms race" between counterfeiter and police drives both sides to improve. The ideal end state is when the counterfeiter's technique is so perfect that the police cannot distinguish real from fake — at which point counterfeit currency is equivalent to genuine currency.

GAN Architecture in Detail

Overall Architecture

            随机噪声                     真实数据
           z ~ N(0, I)                  x ~ p_data
               │                            │
               ▼                            │
        ┌──────────────┐                    │
        │  Generator G │                    │
        │  (神经网络)   │                    │
        └──────┬───────┘                    │
               │                            │
               ▼                            ▼
            G(z)                           x
         (生成样本)                     (真实样本)
               │                            │
               └──────────┬─────────────────┘
                          │
                          ▼
                 ┌─────────────────┐
                 │ Discriminator D │
                 │   (神经网络)     │
                 └────────┬────────┘
                          │
                          ▼
                    D(·) ∈ [0, 1]
                 (真实概率的估计)

Generator G

The generator is a mapping from a low-dimensional latent space to a high-dimensional data space:

\[ G: \mathbb{R}^d \rightarrow \mathbb{R}^n, \quad z \mapsto G(z) \]

Input: A random noise vector \(z \sim p_z(z)\), typically \(z \sim \mathcal{N}(0, I)\) or \(z \sim \text{Uniform}(-1, 1)\)
Output: A generated fake sample \(G(z)\), with the same dimensionality as real data
Objective: Make the distribution \(p_g\) of \(G(z)\) as close as possible to the real data distribution \(p_{data}\)

In image generation tasks, G typically uses transposed convolutions to progressively upsample low-dimensional noise into high-resolution images.

Discriminator D

The discriminator is a binary classifier:

\[ D: \mathbb{R}^n \rightarrow [0, 1], \quad x \mapsto D(x) \]

Input: A sample (which may be a real sample \(x\) or a generated sample \(G(z)\))
Output: An estimate of the probability that the sample comes from the real data
Objective: Output high probability (close to 1) for real samples and low probability (close to 0) for generated samples

Adversarial Training Objective

GAN training is formalized as the following minimax game:

\[ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]

Breaking down this objective function:

\(\mathbb{E}_{x \sim p_{data}}[\log D(x)]\): The log-probability that D classifies real samples as "real." D wants to maximize this term (pushing \(D(x) \to 1\))
\(\mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\): The log-probability that D classifies generated samples as "fake." D wants to maximize this term (pushing \(D(G(z)) \to 0\)), while G wants to minimize it (pushing \(D(G(z)) \to 1\))

目标函数的直觉

The discriminator D wants to make \(V(D,G)\) as large as possible — correctly identifying genuine samples and catching counterfeits. The generator G wants to make \(V(D,G)\) as small as possible — making the discriminator mistake counterfeits for genuine samples. This is the meaning of \(\min_G \max_D\).

Training Process in Detail

Alternating Training

GAN training employs an alternating optimization strategy. In each iteration:

Step 1: Train the Discriminator D (k steps)

Fix G and update D's parameters to maximize \(V(D, G)\):

Sample a mini-batch \(\{x^{(1)}, \ldots, x^{(m)}\}\) from the real data
Sample a mini-batch \(\{z^{(1)}, \ldots, z^{(m)}\}\) from the noise distribution
Update D via gradient ascent:

\[ \nabla_{\theta_d} \frac{1}{m} \sum_{i=1}^{m} \left[ \log D(x^{(i)}) + \log(1 - D(G(z^{(i)}))) \right] \]

Step 2: Train the Generator G (1 step)

Fix D and update G's parameters to minimize \(V(D, G)\):

Sample a mini-batch \(\{z^{(1)}, \ldots, z^{(m)}\}\) from the noise distribution
Update G via gradient descent:

\[ \nabla_{\theta_g} \frac{1}{m} \sum_{i=1}^{m} \log(1 - D(G(z^{(i)}))) \]

为什么先训练 D 多步？

If D is too weak, it cannot provide meaningful gradient signals to G. Intuitively, if the "police" cannot even tell real from fake, the "counterfeiter" has no way of knowing which direction to improve. The original paper therefore recommends training D for k steps before each step of G training (the paper found k=1 sufficient, but in practice more steps are sometimes needed).

Optimal Discriminator

Given a fixed generator G, the optimal discriminator \(D^*\) has a closed-form solution:

\[ D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \]

Derivation: For a fixed G, the optimization of \(V(D, G)\) with respect to D can be written as:

\[ V(D, G) = \int_x \left[ p_{data}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx \]

Taking the derivative of the integrand with respect to \(D(x)\) and setting it to zero:

\[ \frac{p_{data}(x)}{D(x)} - \frac{p_g(x)}{1 - D(x)} = 0 \]

Solving yields \(D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\).

Global Optimum

Substituting \(D^*\) back into the objective function, one can show that:

\[ V(D^*, G) = -\log 4 + 2 \cdot D_{JS}(p_{data} \| p_g) \]

where \(D_{JS}\) is the Jensen-Shannon (JS) Divergence:

\[ D_{JS}(p \| q) = \frac{1}{2} D_{KL}\left(p \left\| \frac{p+q}{2}\right.\right) + \frac{1}{2} D_{KL}\left(q \left\| \frac{p+q}{2}\right.\right) \]

Since JS divergence is always non-negative and equals zero if and only if \(p_{data} = p_g\):

The global optimum is achieved when \(p_g = p_{data}\)
At this point, \(V(D^*, G) = -\log 4\)
At this point, \(D^*(x) = \frac{1}{2}\), meaning the discriminator outputs 0.5 for every sample — completely unable to distinguish real from fake

Non-Saturating Loss

实际训练中 G 的梯度问题

Early in training, G produces poor samples and \(D(G(z)) \approx 0\). In this regime, \(\log(1 - D(G(z))) \approx \log(1) = 0\), so the gradient is nearly zero and G can hardly learn. This is known as the saturation problem.

The original objective has G minimize \(\log(1 - D(G(z)))\), but in practice it is typically replaced with maximizing \(\log D(G(z))\):

Objective	G's Loss Function	Gradient Behavior
Original (Minimax)	\(\log(1 - D(G(z)))\)	Vanishing gradient when \(D(G(z)) \to 0\)
Alternative (Non-saturating)	\(-\log D(G(z))\)	Large gradient when \(D(G(z)) \to 0\)

The two objectives have the same gradient magnitude at \(D(G(z)) = 0.5\), but the non-saturating version provides much stronger gradient signals early in training, greatly improving training stability.

Numerical Example: Fitting a 1D Gaussian

To build intuition for the GAN training process, consider a simplified example: the real data follows a one-dimensional Gaussian \(p_{data} = \mathcal{N}(5, 1)\), and the generator learns this distribution starting from uniform noise \(z \sim \text{Uniform}(0, 1)\).

Suppose G is a simple linear transformation \(G(z) = az + b\), and D is a small neural network.

Visualizing the Training Process

Stage	G's Output Distribution	D's Behavior	Description
Initial	\(G(z) \sim \text{Uniform}(0, 1)\) (far from target)	\(D\) easily distinguishes real from fake, \(D(x_{real}) \approx 1\), \(D(G(z)) \approx 0\)	G's samples look nothing like real data
Step 100	Mean of \(G(z)\) distribution starts shifting toward 5	\(D\) can still distinguish reasonably well, but accuracy drops	G has learned the approximate location
Step 500	\(G(z) \approx \mathcal{N}(4.5, 0.8)\)	Discrimination becomes difficult, \(D(\cdot) \approx 0.6\text{--}0.7\)	G is approaching the target distribution
Converged	\(G(z) \approx \mathcal{N}(5, 1)\)	\(D(x) \approx 0.5\) (for all inputs)	Nash equilibrium reached

Loss Trends

Loss
 │
 │  D_loss
 │  ╲
 │   ╲        ___________
 │    ╲______╱            ─── → ln(2) ≈ 0.693
 │
 │         ╱‾‾‾‾‾‾‾‾‾‾‾‾‾
 │   _____╱
 │  ╱
 │ ╱  G_loss
 │
 └──────────────────────────→ Training Steps

In the ideal case, when training reaches equilibrium:

\(D_{loss} = -[\log(0.5) + \log(0.5)] = \log 4 \approx 1.386\) (averaged per sample: \(\log 2 \approx 0.693\))
\(G_{loss} = -\log(0.5) = \log 2 \approx 0.693\)

Training Difficulties and Solutions

GAN training is notoriously unstable. Below are several core challenges and corresponding strategies.

Mode Collapse

Problem: The generator learns to produce only a few types of samples while ignoring other modes of the data distribution. For example, on the MNIST dataset, G might only generate digits "1" and "7" while neglecting all other digits.

Cause: G discovers that certain samples are particularly effective at fooling D, and "takes the shortcut" of only generating those samples. From an optimization perspective, this occurs because G's objective function does not directly penalize lack of diversity.

Solutions:

Minibatch Discrimination: Let D examine a batch of generated samples simultaneously and assess their diversity
Unrolled GAN: Let G "look ahead" several steps of D's future updates during optimization
WGAN: Replace JS divergence with the Wasserstein distance, fundamentally alleviating the problem

Training Instability

Problem: When D is too strong, G receives near-zero gradients (vanishing gradients); when D is too weak, the gradient signals G receives are meaningless. The two networks must maintain a delicate balance.

Analogy: It is like a student-teacher chess match — if the teacher is too strong, the student learns nothing (crushed at every move); if the teacher is too weak, the student also learns nothing (wins effortlessly every time).

Solutions:

Carefully tune the learning rates and training step ratios for D and G
Use Spectral Normalization to constrain D's Lipschitz constant
Use the Two-Timescale Update Rule (TTUR): a larger learning rate for D and a smaller one for G

Vanishing Gradients

Problem: When D is trained too well, \(D(G(z)) \approx 0\), causing the original loss \(\log(1 - D(G(z))) \approx 0\), and G can barely receive gradient updates.

Mathematical Analysis: Under the optimal discriminator \(D^*\), when the supports of \(p_g\) and \(p_{data}\) do not overlap (which almost certainly happens in high-dimensional spaces), the JS divergence becomes a constant \(\log 2\), and the gradient is zero.

Solutions:

Use the non-saturating loss
Use the Wasserstein distance (which provides meaningful gradients even when distributions do not overlap)
Add noise to the inputs (Instance Noise) to make the supports of the real and fake distributions overlap

Evaluation Difficulty

Problem: Unlike classification tasks with accuracy or regression tasks with MSE, GANs lack a single reliable metric for measuring generation quality. The training loss itself does not reflect generation quality — oscillating D and G losses do not necessarily indicate training failure.

Common Evaluation Metrics:

Metric	What It Measures	Pros	Cons
IS (Inception Score)	Generation quality + diversity	Simple to compute	Does not compare against real data
FID (Frechet Inception Distance)	Distance between generated and real distributions	Highly correlated with human judgment	Requires a large number of samples
Precision & Recall	Quality vs. coverage trade-off	Separates two dimensions	Computationally expensive

FID 的直觉

FID extracts features from both real and generated images at an intermediate layer of the Inception network, models each set of features as a Gaussian distribution, and then computes the Frechet distance between the two Gaussians. A lower FID indicates that the generated images are closer to the real images.

GAN Variants

After the basic GAN framework was introduced, a large number of improved variants emerged, advancing generative models along multiple fronts including architecture design, loss functions, and training strategies.

DCGAN (Deep Convolutional GAN, 2015)

DCGAN was the first work to successfully incorporate CNNs into GANs, proposing a set of architectural guidelines:

Replace pooling layers with strided convolutions
Use Batch Normalization in both G and D (except in D's input layer and G's output layer)
Remove fully connected layers in favor of a fully convolutional architecture
Use ReLU activations in G (with Tanh in the output layer) and LeakyReLU in D

Generator 架构（DCGAN）:
z ∈ R^100 → FC → Reshape(4×4×1024) → ConvT(512) → ConvT(256) → ConvT(128) → ConvT(3) → 64×64×3 图像
                  (每层都有 BN + ReLU，输出层用 Tanh)

Discriminator 架构（DCGAN）:
64×64×3 图像 → Conv(128) → Conv(256) → Conv(512) → Conv(1024) → FC → Sigmoid
               (每层都有 BN + LeakyReLU，输入层无 BN)

The significance of DCGAN lies in demonstrating that GANs can generate reasonably high-quality images and learn meaningful latent space representations (e.g., man with glasses - man + woman = woman with glasses).

WGAN (Wasserstein GAN, 2017)

WGAN was a major breakthrough in GAN training stability. Its core modification replaces JS divergence with the Wasserstein-1 distance (Earth Mover's Distance):

\[ W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p_{data}, p_g)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|] \]

Intuitively, the Wasserstein distance measures "the minimum amount of work required to move a pile of dirt from distribution \(p_g\) to distribution \(p_{data}\)."

Through the Kantorovich-Rubinstein duality, the WGAN objective becomes:

\[ \min_G \max_{D \in \text{1-Lipschitz}} \mathbb{E}_{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))] \]

Key modifications:

D (now called the Critic) no longer applies a Sigmoid to its output — instead of outputting a probability, it outputs a real-valued score
Weight Clipping (clamping D's parameters to \([-c, c]\)) is used to approximately enforce the Lipschitz constraint
Batch Normalization is removed from D (as it would violate the Lipschitz constraint)
The RMSProp optimizer is used (momentum-based optimizers like Adam are avoided)

WGAN 解决了什么？

JS divergence equals a constant \(\log 2\) when the two distributions do not overlap, yielding zero gradients. The Wasserstein distance, by contrast, is continuous and differentiable even when distributions do not overlap, consistently providing meaningful gradient signals to G. Additionally, the Critic's loss can serve as an indicator of training progress — it correlates positively with generation quality.

WGAN-GP (WGAN with Gradient Penalty, 2017)

The weight clipping in WGAN is a crude approach that can cause weights to concentrate at the clipping boundaries \(\{-c, c\}\), limiting the model's expressive capacity.

WGAN-GP proposes replacing weight clipping with a gradient penalty:

\[ L = \underbrace{\mathbb{E}_{z \sim p_z}[D(G(z))] - \mathbb{E}_{x \sim p_{data}}[D(x)]}_{\text{Critic loss}} + \lambda \underbrace{\mathbb{E}_{\hat{x} \sim p_{\hat{x}}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]}_{\text{Gradient Penalty}} \]

Here \(\hat{x}\) is a random interpolation between real and generated samples, and \(\lambda\) is typically set to 10.

The gradient penalty enforces the 1-Lipschitz constraint across the entire space (keeping gradient norms close to 1), resulting in more stable training and better performance.

Conditional GAN (cGAN, 2014)

Conditional GANs introduce additional conditioning information \(y\) (such as class labels) into both G and D:

\[ \min_G \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x|y)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z|y)|y))] \]

This enables control over the generated content. For example, given the condition \(y = \text{"digit 5"}\), G will specifically generate images of the digit 5.

Pix2Pix (2016)

Pix2Pix applies cGAN to paired image-to-image translation tasks (e.g., semantic segmentation maps \(\to\) photographs, line drawings \(\to\) color images).

G uses a U-Net architecture (an encoder-decoder with skip connections)
D uses PatchGAN: instead of producing a single real/fake judgment for the entire image, it independently classifies each local patch as real or fake
Loss = cGAN loss + L1 reconstruction loss

CycleGAN (2017)

CycleGAN addresses unpaired image translation (e.g., horses \(\leftrightarrow\) zebras, summer \(\leftrightarrow\) winter).

The core idea is cycle consistency loss:

\[ L_{cyc} = \mathbb{E}_{x}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y}[\|G(F(y)) - y\|_1] \]

Here G translates images from domain A to domain B, and F translates from domain B back to domain A. Cycle consistency requires that translating an image to the other domain and back should recover the original image.

StyleGAN (2018-2021)

The StyleGAN series (StyleGAN, StyleGAN2, StyleGAN3) achieved breakthrough results in high-quality face generation.

Core innovations:

Mapping network: \(z \to w\), first mapping noise through an 8-layer MLP to an intermediate latent space \(\mathcal{W}\)
Adaptive Instance Normalization (AdaIN): Using \(w\) to control the generation style at each layer
Noise injection: Injecting random noise at each layer to control fine details (e.g., hair texture, skin pores)
Progressive training (StyleGAN1) / skip connections (StyleGAN2)

StyleGAN2 generates face images at 1024x1024 resolution that are virtually indistinguishable from real photographs to the human eye.

Progressive GAN (2017)

Progressive GAN introduced progressive growing training:

训练阶段：4×4 → 8×8 → 16×16 → 32×32 → ... → 1024×1024

Training begins at low resolution and progressively adds layers and increases resolution. This approach makes training more stable because:

The low-resolution stages quickly learn global structure
The high-resolution stages only need to learn fine details
It avoids the difficulty of optimizing in high-dimensional space from the start

GAN vs. VAE vs. Diffusion Comparison

Dimension	GAN	VAE	Diffusion
Modeling Approach	Implicit density (adversarial training)	Explicit density (variational inference)	Explicit density (denoising score matching)
Generation Quality	High (sharp, realistic)	Medium (often blurry)	Very high (rich in detail)
Training Stability	Poor (requires careful tuning)	Good (stable convergence)	Good (simple MSE loss)
Mode Coverage	Poor (mode collapse)	Good (likelihood optimization)	Very good (likelihood optimization)
Generation Speed	Fast (single forward pass)	Fast (single forward pass)	Slow (requires many iterative denoising steps)
Controllability	Medium (requires conditioning mechanisms)	Good (explicit latent space)	Good (Classifier-free Guidance)
Likelihood Computation	Not computable	Lower bound computable (ELBO)	Lower bound computable
Typical Applications	Image super-resolution, style transfer	Representation learning, anomaly detection	Text-to-image generation
Representative Models	StyleGAN, BigGAN	VQ-VAE, DALL-E 1	Stable Diffusion, DALL-E 2/3

一句话总结

GANs generate fast and sharp results but are difficult to train; VAEs train stably but produce blurry outputs; Diffusion models achieve the highest quality but are the slowest to generate.

Reflections and Discussion

Why Has Diffusion Replaced GANs?

Starting with DDPM in 2020, diffusion models have progressively displaced GANs as the dominant approach in image generation. The main reasons include:

Training stability: The training objective of diffusion models is a simple MSE denoising loss, free from the balancing act between G and D inherent in adversarial training. Anyone can achieve good results with a standard training pipeline, whereas GAN training requires numerous tricks and experience.
Mode coverage: Diffusion models optimize the log-likelihood (or its lower bound), which naturally encourages coverage of all modes in the data distribution. The mode collapse problem in GANs has never been perfectly resolved.
Scalability: Diffusion models perform better with large-scale data and large models, and their scaling behavior is more predictable (analogous to the Scaling Laws observed in LLMs).
Controllable generation: Techniques such as Classifier-free Guidance make conditional generation with diffusion models highly flexible, driving the success of products like DALL-E 2, Stable Diffusion, and Midjourney.

Where Are GANs Still Important?

Despite yielding the "main stage" of image generation to diffusion models, GANs retain irreplaceable advantages in several scenarios:

Real-time applications: GANs require only a single forward pass to generate, which is far faster than diffusion models that need tens to hundreds of denoising steps. GANs remain the first choice in video games, real-time style transfer, and mobile applications.
Image super-resolution: Models like ESRGAN remain the mainstream approach for super-resolution to this day.
Image editing and manipulation: GAN Inversion techniques can map real images back into the latent space for editing (e.g., changing age, expression, hairstyle).
Data augmentation: Using GANs to generate synthetic data for training set expansion is particularly valuable in data-scarce domains such as medical imaging.
Accelerating diffusion models: When distilling diffusion models into single-step generators, adversarial training is often employed (e.g., Adversarial Diffusion Distillation in SDXL-Turbo).

The Broader Impact of Adversarial Training

The adversarial training paradigm introduced by GANs extends far beyond image generation, profoundly influencing multiple areas of machine learning:

Adversarial examples and robustness: Adversarial training is a core method for improving model robustness
Domain adaptation: Learning domain-invariant features through adversarial training (e.g., DANN)
Text generation: SeqGAN and related work brought GAN concepts to discrete sequence generation (though with limited success)
Reinforcement learning: GAIL (Generative Adversarial Imitation Learning) uses adversarial training for imitation learning
Fairness: Adversarial training is used to remove sensitive attribute information from models
Privacy protection: Adversarial training has been applied in differential privacy and federated learning

历史地位

GANs may no longer be the optimal approach for image generation, but their contribution to the field of deep learning is lasting: they proved the viability of implicit density modeling, pioneered the adversarial training paradigm, and inspired countless subsequent works. As Goodfellow himself has noted, the most important contribution of GANs is not any specific model, but an entirely new training methodology.

References

Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
Radford, A., et al. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN). ICLR.
Arjovsky, M., et al. (2017). Wasserstein GAN. ICML.
Gulrajani, I., et al. (2017). Improved Training of Wasserstein GANs (WGAN-GP). NeurIPS.
Mirza, M. & Osindero, S. (2014). Conditional Generative Adversarial Nets. arXiv.
Isola, P., et al. (2016). Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix). CVPR.
Zhu, J.-Y., et al. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN). ICCV.
Karras, T., et al. (2018). Progressive Growing of GANs. ICLR.
Karras, T., et al. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN). CVPR.
Karras, T., et al. (2020). Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2). CVPR.

GAN (Generative Adversarial Network)

Background and Motivation

A Review of Generative Models

The Core Idea of GANs: A Game-Theoretic Perspective

The "Counterfeiter vs. Police" Analogy

GAN Architecture in Detail

Overall Architecture

Generator G

Discriminator D

Adversarial Training Objective

Training Process in Detail

Alternating Training

Optimal Discriminator

Global Optimum

Non-Saturating Loss

Numerical Example: Fitting a 1D Gaussian

Visualizing the Training Process

Loss Trends

Training Difficulties and Solutions

Mode Collapse

Training Instability

Vanishing Gradients

Evaluation Difficulty

GAN Variants

DCGAN (Deep Convolutional GAN, 2015)

WGAN (Wasserstein GAN, 2017)

WGAN-GP (WGAN with Gradient Penalty, 2017)

Conditional GAN (cGAN, 2014)

Pix2Pix (2016)

CycleGAN (2017)

StyleGAN (2018-2021)

Progressive GAN (2017)

GAN vs. VAE vs. Diffusion Comparison

Reflections and Discussion

Why Has Diffusion Replaced GANs?

Where Are GANs Still Important?

The Broader Impact of Adversarial Training

References

评论 #