Variational Autoencoder (VAE)

In 2013, Kingma and Welling proposed the Variational Autoencoder (VAE) in their paper "Auto-Encoding Variational Bayes". VAE combines deep learning with Bayesian inference, introducing probabilistic modeling into the autoencoder framework so that the latent space becomes continuous and sampable. This made it the first deep generative model that truly possesses both inference and generation capabilities.

1. Background and Motivation

1.1 The Goal of Generative Models

The core objective of generative models is to learn the true data distribution \(p(x)\) and sample from it to generate new, plausible data. Given a set of training data \(\{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}\), we aim to find a parameterized model \(p_\theta(x)\) that approximates the true data distribution as closely as possible.

1.2 Autoencoder Recap

A traditional autoencoder (AE) consists of two components — an Encoder and a Decoder:

Encoder \(f_\phi\): maps the input \(x\) to a low-dimensional latent representation \(z = f_\phi(x)\)
Decoder \(g_\theta\): reconstructs the input from the latent representation as \(\hat{x} = g_\theta(z)\)

The training objective is to minimize the reconstruction error, e.g., \(\|x - \hat{x}\|^2\).

Limitations of Traditional AE

The latent space of a traditional AE is irregular: different training samples are mapped to scattered points in latent space, and the regions between these points carry no semantic meaning. Consequently, randomly sampling a point in the latent space and passing it through the Decoder will not produce meaningful output — AE is not a proper generative model.

1.3 The Core Innovation of VAE

The key idea behind VAE is: instead of having the Encoder output a deterministic point, it outputs the parameters of a probability distribution. Specifically, for each input \(x\), the Encoder outputs a Gaussian distribution \(q_\phi(z|x) = \mathcal{N}(\mu, \sigma^2 I)\), and \(z\) is then sampled from this distribution. Meanwhile, a KL divergence regularization term forces these distributions to stay close to the standard normal distribution \(\mathcal{N}(0, I)\).

The result is a latent space that is continuous, smooth, and sampable. To generate new data, one simply samples \(z\) from \(\mathcal{N}(0, I)\) and passes it through the Decoder.

2. VAE Architecture in Detail

2.1 Overall Architecture

                         Encoder                              Decoder
                    ┌───────────────┐                    ┌───────────────┐
                    │               │──→ μ ──┐           │               │
   Input x ──────→ │  Neural Net   │        ├─→ z ────→ │  Neural Net   │ ──→ x̂
                    │               │──→ σ ──┘           │               │
                    └───────────────┘    ↑               └───────────────┘
                                        │
                                   ε ~ N(0, I)
                                (Reparameterization)

                    z = μ + σ ⊙ ε

The entire process can be divided into three stages:

Encoding: The Encoder network takes input \(x\) and outputs the parameters of the latent variable distribution — \(\mu\) and \(\sigma\) (or \(\log \sigma^2\))
Reparameterization Sampling: Sample \(z\) from \(\mathcal{N}(\mu, \sigma^2 I)\)
Decoding: The Decoder network takes \(z\) and outputs the reconstructed data \(\hat{x}\)

2.2 Encoder: From Deterministic to Probabilistic

A traditional AE Encoder outputs a deterministic vector \(z\). The VAE Encoder instead outputs two vectors:

\(\mu = f_\mu(x)\): the mean of the latent variable distribution
\(\log \sigma^2 = f_\sigma(x)\): the log-variance of the latent variable distribution (in practice, outputting \(\log \sigma^2\) rather than \(\sigma\) is preferred because its range spans the entire real line, making it numerically more stable)

This defines a conditional distribution:

\[ q_\phi(z|x) = \mathcal{N}(z; \mu(x), \sigma^2(x) I) \]

2.3 Reparameterization Trick

Problem: We need to sample \(z\) from \(q_\phi(z|x) = \mathcal{N}(\mu, \sigma^2 I)\), but sampling is a stochastic operation that is not differentiable, making it impossible to compute gradients via backpropagation.

Solution: Separate the randomness from the computational graph. The procedure is as follows:

Sample noise from the standard normal distribution: \(\epsilon \sim \mathcal{N}(0, I)\)
Obtain \(z\) through a deterministic transformation:

\[ z = \mu + \sigma \odot \epsilon \]

where \(\odot\) denotes element-wise multiplication.

The Essence of Reparameterization

The key insight of this trick is that \(z\) is now a deterministic function of \(\mu\) and \(\sigma\) (given \(\epsilon\)), so gradients can flow through \(z\) back to \(\mu\) and \(\sigma\), and then back to the Encoder parameters \(\phi\). The randomness is "outsourced" to \(\epsilon\), which does not depend on any learnable parameters.

\[ \frac{\partial z}{\partial \mu} = 1, \quad \frac{\partial z}{\partial \sigma} = \epsilon \]

2.4 Decoder

The Decoder receives the latent variable \(z\) and outputs the reconstructed data \(\hat{x}\). From a probabilistic perspective, the Decoder defines the conditional distribution \(p_\theta(x|z)\):

Continuous data (e.g., grayscale image pixel values normalized to \([0,1]\)): assume \(p_\theta(x|z) = \mathcal{N}(x; \hat{x}, I)\), in which case maximizing the log-likelihood is equivalent to minimizing MSE
Binary data (e.g., binarized MNIST): assume \(p_\theta(x|z) = \text{Bernoulli}(x; \hat{x})\), in which case maximizing the log-likelihood is equivalent to minimizing binary cross-entropy (BCE)

3. Mathematical Derivation

This is the theoretical core of VAE.

3.1 Objective: Maximizing the Marginal Likelihood

Our goal is to learn parameters \(\theta\) that maximize the marginal log-likelihood of the observed data:

\[ \log p_\theta(x) = \log \int p_\theta(x|z) \, p(z) \, dz \]

where \(p(z) = \mathcal{N}(0, I)\) is the prior distribution over the latent variables.

Intractable Integral

The integral above requires integrating over all possible values of \(z\). In high-dimensional spaces, this integral has no closed-form solution and cannot be efficiently estimated via Monte Carlo sampling (since most values of \(z\) contribute negligibly to \(p_\theta(x|z)\)).

3.2 Introducing Variational Inference

Since the true posterior \(p_\theta(z|x)\) is intractable, we introduce a parameterized approximate posterior \(q_\phi(z|x)\) to approximate it. We now derive the ELBO (Evidence Lower Bound).

Step 1: Write out the marginal log-likelihood

\[ \log p_\theta(x) = \log \int p_\theta(x, z) \, dz \]

Step 2: Introduce \(q_\phi(z|x)\) using the idea of importance sampling

\[ \log p_\theta(x) = \log \int q_\phi(z|x) \frac{p_\theta(x, z)}{q_\phi(z|x)} dz = \log \, \mathbb{E}_{q_\phi(z|x)} \left[ \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] \]

Step 3: Apply Jensen's inequality (\(\log\) is concave, so \(\log \mathbb{E}[\cdot] \geq \mathbb{E}[\log(\cdot)]\))

\[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)} \left[ \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] \]

This lower bound is the ELBO:

\[ \text{ELBO}(\phi, \theta; x) = \mathbb{E}_{q_\phi(z|x)} \left[ \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \right] \]

3.3 Decomposing the ELBO

Factoring the joint distribution as \(p_\theta(x, z) = p_\theta(x|z) \, p(z)\):

\[ \text{ELBO} = \mathbb{E}_{q_\phi(z|x)} \left[ \log \frac{p_\theta(x|z) \, p(z)}{q_\phi(z|x)} \right] \]

\[ = \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] + \mathbb{E}_{q_\phi(z|x)} \left[ \log \frac{p(z)}{q_\phi(z|x)} \right] \]

\[ = \underbrace{\mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right]}_{\text{Reconstruction term}} - \underbrace{D_{KL}\left( q_\phi(z|x) \,\|\, p(z) \right)}_{\text{KL divergence regularization}} \]

Intuitive Interpretation of the Two ELBO Terms

Reconstruction term: Encourages the Decoder to reconstruct the original input \(x\) from \(z\) as accurately as possible. A larger value indicates better reconstruction.
KL divergence term: Penalizes the discrepancy between the approximate posterior \(q_\phi(z|x)\) and the prior \(p(z) = \mathcal{N}(0, I)\). This term forces the latent space to remain well-structured, ensuring that sampling from the prior produces meaningful outputs.

3.4 Relationship Between ELBO and Marginal Likelihood

An alternative perspective comes from the following decomposition of the marginal log-likelihood:

\[ \log p_\theta(x) = \text{ELBO}(\phi, \theta; x) + D_{KL}\left( q_\phi(z|x) \,\|\, p_\theta(z|x) \right) \]

Since KL divergence is always non-negative, we have \(\text{ELBO} \leq \log p_\theta(x)\), confirming that the ELBO is indeed a lower bound on the marginal log-likelihood. Maximizing the ELBO simultaneously accomplishes two things:

Maximizes the marginal likelihood \(\log p_\theta(x)\) (improving the model's fit to the data)
Minimizes \(D_{KL}(q_\phi(z|x) \| p_\theta(z|x))\) (making the approximate posterior closer to the true posterior)

3.5 Concrete Form of the Loss Function

The VAE loss function is the negative ELBO:

\[ \mathcal{L}(\phi, \theta; x) = -\text{ELBO} = \underbrace{-\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction loss}} + \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{KL loss}} \]

Reconstruction Loss

In practice, the expectation in the reconstruction term is estimated via single-sample (\(L=1\)) Monte Carlo:

MSE loss (Gaussian Decoder): \(\mathcal{L}_{\text{recon}} = \|x - \hat{x}\|^2\)
BCE loss (Bernoulli Decoder): \(\mathcal{L}_{\text{recon}} = -\sum_i [x_i \log \hat{x}_i + (1 - x_i) \log(1 - \hat{x}_i)]\)

Closed-Form KL Divergence

When \(q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))\) and \(p(z) = \mathcal{N}(0, I)\), the KL divergence admits the following closed-form solution (assuming the latent space has dimensionality \(J\)):

\[ D_{KL}(q_\phi(z|x) \| p(z)) = -\frac{1}{2} \sum_{j=1}^{J} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right) \]

Derivation:

The general formula for KL divergence between two Gaussian distributions is:

\[ D_{KL}(\mathcal{N}(\mu_1, \Sigma_1) \| \mathcal{N}(\mu_2, \Sigma_2)) = \frac{1}{2} \left[ \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^\top \Sigma_2^{-1} (\mu_2 - \mu_1) - k + \log \frac{|\Sigma_2|}{|\Sigma_1|} \right] \]

Substituting \(\mu_1 = \mu\), \(\Sigma_1 = \text{diag}(\sigma^2)\), \(\mu_2 = 0\), \(\Sigma_2 = I\):

\[ D_{KL} = \frac{1}{2} \left[ \text{tr}(\text{diag}(\sigma^2)) + \mu^\top \mu - J + \log \frac{1}{\prod_j \sigma_j^2} \right] \]

\[ = \frac{1}{2} \left[ \sum_j \sigma_j^2 + \sum_j \mu_j^2 - J - \sum_j \log \sigma_j^2 \right] \]

\[ = -\frac{1}{2} \sum_{j=1}^{J} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right) \]

4. Forward Pass Numerical Example

To deepen understanding, let us walk through a minimal example of a VAE forward pass.

Setup: input dimensionality \(d = 4\), latent space dimensionality \(J = 2\).

Step 1: Encoding

Suppose the input is \(x = [0.8, 0.3, 0.9, 0.5]\). After passing through the Encoder network, we obtain:

\[ \mu = [0.5, -0.3], \quad \log \sigma^2 = [-1.0, -0.5] \]

From which we can compute:

\[ \sigma^2 = [e^{-1.0}, e^{-0.5}] = [0.368, 0.607], \quad \sigma = [0.607, 0.779] \]

Step 2: Reparameterization Sampling

Sample \(\epsilon = [0.42, -0.15]\) from the standard normal distribution, then:

\[ z = \mu + \sigma \odot \epsilon = [0.5 + 0.607 \times 0.42, \; -0.3 + 0.779 \times (-0.15)] \]

\[ z = [0.755, -0.417] \]

Step 3: Decoding

Feed \(z = [0.755, -0.417]\) into the Decoder network. Suppose the output is:

\[ \hat{x} = [0.72, 0.35, 0.85, 0.48] \]

Step 4: Computing the Loss

Reconstruction loss (MSE):

\[ \mathcal{L}_{\text{recon}} = \|x - \hat{x}\|^2 = (0.8-0.72)^2 + (0.3-0.35)^2 + (0.9-0.85)^2 + (0.5-0.48)^2 \]

\[ = 0.0064 + 0.0025 + 0.0025 + 0.0004 = 0.0118 \]

KL loss:

\[ D_{KL} = -\frac{1}{2} \sum_{j=1}^{2} (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2) \]

\[ = -\frac{1}{2} \left[ (1 + (-1.0) - 0.25 - 0.368) + (1 + (-0.5) - 0.09 - 0.607) \right] \]

\[ = -\frac{1}{2} \left[ -0.618 + (-0.197) \right] = -\frac{1}{2} \times (-0.815) = 0.408 \]

Total loss:

\[ \mathcal{L} = \mathcal{L}_{\text{recon}} + D_{KL} = 0.0118 + 0.408 = 0.420 \]

Observation

Early in training, the KL loss is typically much larger than the reconstruction loss, since the latent distributions have not yet been regularized toward the standard normal. As training progresses, the two terms gradually reach a balance.

5. Training Process and Techniques

5.1 Basic Training Loop

for each epoch:
    for each batch x:
        1. μ, log σ² = Encoder(x)
        2. ε ~ N(0, I)
        3. z = μ + σ ⊙ ε
        4. x̂ = Decoder(z)
        5. L_recon = reconstruction_loss(x, x̂)
        6. L_KL = -0.5 * sum(1 + log σ² - μ² - σ²)
        7. L = L_recon + L_KL
        8. Backpropagate and update φ and θ

5.2 Posterior Collapse

Phenomenon: During training, the Decoder learns to ignore the latent variable \(z\) and relies solely on its own capacity (e.g., a powerful autoregressive structure) to generate data. In this case, \(q_\phi(z|x)\) degenerates to the prior \(p(z) = \mathcal{N}(0, I)\), the KL divergence drops to zero, and the latent variable becomes "dead code."

Cause: The KL regularization term dominates too early in training, forcing the Encoder to collapse toward the prior before it has learned useful encodings, leaving the Decoder to handle reconstruction entirely on its own.

5.3 KL Annealing

Strategy: Assign a small weight \(\beta\) to the KL term at the beginning of training, and gradually increase it to 1:

\[ \mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \cdot D_{KL} \]

Linear annealing: \(\beta\) increases linearly from 0 to 1 (e.g., over the first 10 epochs)
Cyclical annealing: \(\beta\) periodically increases from 0 to 1, restarting at each cycle

This allows the Encoder sufficient freedom to learn meaningful encodings early on, with the latent space being gradually regularized afterward.

5.4 Beta-VAE

Higgins et al. (2017) proposed Beta-VAE, which fixes the KL weight at \(\beta > 1\):

\[ \mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \cdot D_{KL}, \quad \beta > 1 \]

A larger \(\beta\) forces the latent space to be more compact and better aligned with the prior, which encourages learning disentangled representations — where each dimension of the latent space independently encodes a single semantic factor of the data (e.g., facial expression, pose, hairstyle).

The Beta-VAE Trade-off

The larger \(\beta\) is, the greater the degree of disentanglement, but reconstruction quality degrades (since the relative weight of the reconstruction term decreases). In practice, \(\beta\) must be tuned according to the task requirements.

5.5 Other Training Techniques

Free Bits: Set a minimum value for the KL divergence of each latent dimension, preventing certain dimensions from collapsing entirely
Learning Rate Warm-up: Use a smaller learning rate early in training to stabilize the latent space structure
Output \(\log \sigma^2\) instead of \(\sigma\): Avoids numerical overflow when applying exp to \(\sigma\)

6. Important VAE Variants

6.1 Conditional VAE (CVAE)

CVAE introduces a conditioning variable \(c\) (e.g., a class label) into both the encoding and decoding processes:

\[ q_\phi(z|x, c), \quad p_\theta(x|z, c) \]

The ELBO becomes:

\[ \text{ELBO} = \mathbb{E}_{q_\phi(z|x,c)}[\log p_\theta(x|z,c)] - D_{KL}(q_\phi(z|x,c) \| p(z|c)) \]

By specifying different values of \(c\) at generation time, one can control the class or attributes of the generated data. For example, after training a CVAE on MNIST, setting \(c = 7\) will specifically generate the digit "7."

6.2 VQ-VAE (Vector Quantized VAE)

VQ-VAE, proposed by van den Oord et al. (2017), replaces VAE's continuous Gaussian latent space with a discrete latent space:

A learnable codebook \(\{e_1, e_2, \dots, e_K\}\) containing \(K\) embedding vectors is maintained
The Encoder outputs a continuous vector \(z_e\), which is quantized to the nearest codebook embedding via nearest-neighbor lookup:

\[ z_q = e_k, \quad k = \arg\min_j \|z_e - e_j\| \]

The straight-through estimator is used to handle the non-differentiability of the quantization operation
Instead of KL divergence, codebook loss and commitment loss are used

Significance of VQ-VAE

VQ-VAE serves as a foundational component in many modern generative models. DALL-E (OpenAI, 2021) uses dVAE (a VQ-VAE variant) to encode images into discrete tokens, which are then modeled by a Transformer. Stable Diffusion runs the diffusion process in the latent space of a VQ-VAE / KL-VAE, dramatically reducing computational cost.

6.3 VAE-GAN

VAE-GAN combines the strengths of VAE and GAN:

The VAE component provides stable training and a meaningful latent space
The GAN discriminator provides perceptual-level loss signals, replacing pixel-wise reconstruction loss

This hybrid architecture mitigates the blurriness of VAE-generated images while preserving the continuity and interpretability of the latent space.

6.4 Overview of Other Variants

Variant	Core Idea	Year
Ladder VAE	Multi-level latent variables with bottom-up inference	2016
IAF-VAE	Inverse autoregressive flow for more flexible posteriors	2016
WAE	Replaces KL divergence with Wasserstein distance	2018
Optimus	Combines VAE with pretrained Transformers for text generation	2020
NVAE	Deep hierarchical VAE achieving very high image generation quality	2020

7. VAE vs AE vs GAN Comparison

Property	AE	VAE	GAN
Objective	Reconstruct input	Maximize ELBO	Adversarial game
Latent space	Irregular, not sampable	Continuous, regularized, sampable	No explicit latent space structure
Generation ability	None (or very weak)	Yes, but images tend to be blurry	Yes, sharp images
Training stability	Stable	Stable	Unstable, mode collapse
Likelihood estimation	None	Yes (ELBO lower bound)	None
Mathematical framework	Deterministic mapping	Variational inference	Game theory
Interpolation	Poor	Good, continuous smooth latent space	Possible but smoothness not guaranteed
Typical applications	Dimensionality reduction, denoising	Generation, representation learning	High-quality image generation

8. Discussion and Reflections

8.1 Why Are VAE-Generated Images Blurry?

The fundamental reason VAE-generated images are blurry lies in the form of the reconstruction loss:

Limitations of pixel-wise loss: MSE/BCE computes the average pixel-level error. For a region of an image where multiple training samples correspond to different details (e.g., different hair textures), the VAE tends to output the average of these possibilities, resulting in blur
The cost of KL regularization: The KL term forces the latent space to remain well-structured, which inherently compresses information capacity. The "smoothness" of the latent space means that small changes in \(z\) lead to small changes in the output, making it difficult to encode sharp, high-frequency details
Single-sample estimation: During training, typically only one \(z\) is sampled, introducing variance in the gradient estimate of the reconstruction loss, which further biases optimization toward conservative (blurry) solutions

8.2 The Role of VAE in Modern Generative Models

Although VAE has been surpassed by diffusion models and autoregressive models for end-to-end image generation, the ideas and components of VAE continue to play a critical role in modern architectures:

Stable Diffusion's latent space: Uses a pretrained KL-VAE (a regularized autoencoder) to compress images into a low-dimensional latent space, then runs the diffusion process in that latent space. This dramatically reduces computational cost — compressing from pixel space (e.g., \(512 \times 512 \times 3\)) to latent space (e.g., \(64 \times 64 \times 4\))
DALL-E's discrete encoding: Uses dVAE to encode images into discrete token sequences, then models the joint image-text token sequence with a Transformer
Representation learning: VAE's latent space remains widely used in drug discovery (molecular generation), music generation, anomaly detection, and other domains

8.3 From VAE to VQ-VAE to Latent Diffusion

VAE has played a pivotal bridging role in the evolution of modern high-quality generative models:

VAE (2013)
 │  Continuous latent space + variational inference
 │
 ├──→ VQ-VAE (2017)
 │     Discrete latent space + codebook quantization
 │     │
 │     ├──→ VQ-VAE-2 (2019): Hierarchical discrete latent space
 │     │
 │     └──→ DALL-E (2021): dVAE + Transformer
 │
 └──→ KL-VAE / Regularized AE
       │
       └──→ Latent Diffusion / Stable Diffusion (2022)
             Running diffusion in VAE latent space

Key Insight

VAE's most enduring contribution lies not in its generation quality as a standalone generative model, but in the idea of "learning a structured latent space." This idea has been inherited and developed by subsequent models such as VQ-VAE and Latent Diffusion, making it one of the core components of modern generative AI.

References

Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 2014
Higgins, I., et al. (2017). beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017
van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning. NeurIPS 2017
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022