Generative Foundation Models

Overview

Generative foundation models are large-scale pretrained models capable of generating high-quality content (images, video, 3D, audio, etc.) given conditional inputs. Unlike discriminative foundation models, the core objective of generative foundations is to learn the underlying data distribution and sample from it.

Core paradigm: from "understanding the world" to "creating the world."

Generative Foundation Model Taxonomy:

Text → Image: Stable Diffusion, DALL-E, Midjourney
Text → Video: Sora, Runway Gen-3, Kling
Text → 3D:    DreamFusion, Zero-1-to-3
Text → Audio: AudioLM, MusicGen, Bark
Any  → Any:   Unified Generation Models (CoDi, NExT-GPT)

Diffusion as a Generative Foundation

The Diffusion Model is currently the most dominant generative foundation. For detailed principles, see the Diffusion notes.

Core Review

A Diffusion Model is defined by two processes:

Forward process (adding noise):

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) \]

Reverse process (denoising):

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \]

The training objective simplifies to noise prediction:

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] \]

Conditional Generation: Classifier-Free Guidance (CFG)

CFG is a key technique for balancing generation quality and condition adherence:

\[ \hat{\epsilon}_\theta(x_t, c) = (1 + w) \epsilon_\theta(x_t, c) - w \epsilon_\theta(x_t, \varnothing) \]

Here, \(w\) is the guidance scale, \(c\) is the condition (e.g., text), and \(\varnothing\) is the null condition. A larger \(w\) produces outputs more closely aligned with the condition, at the cost of reduced diversity.

DiT: Replacing U-Net with Transformer

DiT (Diffusion Transformer), proposed by Peebles & Xie (2023), replaces the traditional U-Net backbone with a Transformer. For details, see the DiT notes.

Key Improvements

Traditional Diffusion:  Noise → U-Net (CNN-based) → Denoised Image
DiT:                    Noise → Transformer (ViT-based) → Denoised Image

Noisy images are split into patches, similar to ViT processing
Timestep \(t\) and condition \(c\) are injected via AdaLN-Zero
Scaling laws apply equally: larger DiT models produce higher-quality generations

The success of DiT demonstrates that Transformers exhibit superior scaling properties for generative tasks as well, laying the technical foundation for subsequent large-scale generative models such as Sora.

Text-to-Image

Stable Diffusion (Stability AI)

Built on the Latent Diffusion Model (LDM), Stable Diffusion performs diffusion in a compressed latent space, drastically reducing computational cost.

Stable Diffusion Architecture:

Text → CLIP Text Encoder → Text Embeddings
                                ↓ (Cross-Attention)
Random Noise → U-Net (Latent Space) → Denoised Latent
                                ↓
                   VAE Decoder → Image (512x512 / 1024x1024)

Key components:

VAE: Compresses pixel space into latent space (typically \(8 \times\) downsampling)
U-Net / DiT: Performs denoising in latent space
Text Encoder: Encodes text conditions using CLIP or T5
Scheduler: Controls the sampling process (DDPM, DDIM, DPM-Solver, etc.)

Version evolution:

Version	Backbone	Resolution	Text Encoder
SD 1.5	U-Net	512x512	CLIP ViT-L/14
SDXL	U-Net (larger)	1024x1024	CLIP + OpenCLIP
SD 3	DiT (MMDiT)	Multi-resolution	CLIP + T5-XXL

DALL-E Series (OpenAI)

DALL-E (2021): Based on VQ-VAE + Autoregressive Transformer
DALL-E 2 (2022): Based on CLIP + Diffusion (unCLIP architecture)
DALL-E 3 (2023): Improved text understanding; uses T5 to rewrite captions for better text-image alignment

Midjourney

The most commercially successful text-to-image product, renowned for its artistic style. Specific technical details have not been publicly disclosed.

Text-to-Video

Sora (OpenAI, 2024)

Sora is a landmark model in text-to-video generation, demonstrating the potential of "video as a world simulator."

Speculated core technology:

Sora Architecture (speculated):

Video → VAE (Spatiotemporal Compression) → Spacetime Latent Patches
                              ↓
Text → Text Encoder → Conditioning
                              ↓
              DiT (Spacetime Transformer)
                              ↓
              VAE Decoder → Video Output

Key characteristics:

Spacetime patches: Treats video as a sequence of 3D patches, uniformly handling varying resolutions and durations
DiT backbone: Inherits DiT's scaling properties
Long video generation: Capable of generating coherent videos up to 1 minute
Physical understanding: Demonstrates a degree of 3D consistency and understanding of physical laws

Text-to-3D

Text-to-3D generation is a rapidly evolving field, with the core challenge being the scarcity of 3D data.

Optimization-Based Methods

DreamFusion (Poole et al., 2022):

Core idea: Leverage a pretrained 2D diffusion model to provide gradient signals for optimizing a 3D representation (NeRF).

\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t, \epsilon} \left[ w(t) (\epsilon_\phi(x_t; y, t) - \epsilon) \frac{\partial x}{\partial \theta} \right] \]

Here, the SDS (Score Distillation Sampling) loss distills knowledge from a 2D diffusion model into a 3D model.

Feed-Forward Methods

Zero-1-to-3 (2023): Generates multi-view images from a single input image
LRM (Large Reconstruction Model, 2023): A Transformer that directly predicts 3D representations
InstantMesh (2024): Combines multi-view generation with 3D reconstruction

Text-to-Audio

AudioLM (Google, 2022)

Models audio as a sequence of discrete tokens, generating audio using a language model paradigm.

AudioLM Pipeline:

Audio → Neural Codec (e.g., SoundStream) → Discrete Tokens
Tokens → Transformer Language Model → Generated Tokens
Generated Tokens → Codec Decoder → Audio Waveform

MusicGen (Meta, 2023)

A model focused on music generation:

Uses EnCodec to encode audio into multi-layer discrete tokens
Introduces a "delay pattern" to solve parallel generation across multiple codebooks
Supports both text descriptions and melody as conditional inputs

Other Audio Generation Models

Model	Type	Highlights
Bark (Suno)	Speech generation	Supports multiple languages and non-verbal sounds
Stable Audio	Music/Sound effects	Based on Latent Diffusion
VALL-E (Microsoft)	Voice cloning	Clones a voice from just 3 seconds of reference audio

Unified Generation: Any-to-Any Models

The goal of unified generation models is to support arbitrary cross-modal transformations within a single model.

Two Technical Approaches

Approach A: LLM as the brain + external generation models

Approach A:

Input (any modality) → Encoder → LLM (understanding + planning) → Instructions
                                                                    ↓
                                    External Generation Models (Diffusion/Codec) → Output (any modality)

Representative models: NExT-GPT, Visual ChatGPT

Advantages: Reuses existing powerful unimodal generation models.

Disadvantages: End-to-end optimization is difficult; information loss between modules.

Approach B: Unified discrete token system

Approach B:

Input (any modality) → VQ Tokenizer → Discrete Tokens → Transformer → Output Tokens → Detokenizer → Output

Representative models: Chameleon (Meta), Gemini

Advantages: End-to-end training; natural cross-modal interaction.

Disadvantages: Discretization incurs information loss; training is challenging.

Representative Models

CoDi (Microsoft): Achieves any-to-any generation by aligning multimodal latent spaces
Chameleon (Meta, 2024): Unifies text and images as discrete tokens, processed by a single Transformer
Gemini (Google): Natively multimodal, supporting text, image, audio, and video as both input and output

A Unified Perspective on Generative Foundations

Regardless of whether the target is image, video, 3D, or audio generation, the core can be summarized as:

\[ p_\theta(x | c) = \text{GenerativeModel}(c; \theta) \]

where \(c\) is the condition (text, image, etc.) and \(x\) is the output in the target modality.

Unified Framework for Generative Foundations:

Condition c → Condition Encoder → Condition Features
                                      ↓
Noise / Initial Tokens → Generation Backbone (DiT / Autoregressive) → Denoising / Decoding
                                      ↓
                                 Decoder → Target Modality Output

Current trends:

Backbone unification: DiT is progressively becoming the standard architecture for image and video generation
Modality expansion: From text-to-image to text-to-video, 3D, and audio
Quality improvement: Continuous gains through scaling, better data, and stronger condition injection
Controllability: Fine-grained control via techniques such as ControlNet and IP-Adapter
Flow Matching: As an alternative to / improvement over Diffusion, offering a more direct training objective (see Flow Matching notes)

Model	Organization	Highlights
Runway Gen-3	Runway	Production-grade video generation with multiple control modes
Kling	Kuaishou	Long video generation with strong physics simulation
Pika	Pika Labs	Consumer-oriented video editing and generation
CogVideo	Zhipu AI	Open-source video generation model