Skip to content

Generative Foundation Models

Overview

Generative foundation models are large-scale pretrained models capable of generating high-quality content (images, video, 3D, audio, etc.) given conditional inputs. Unlike discriminative foundation models, the core objective of generative foundations is to learn the underlying data distribution and sample from it.

Core paradigm: from "understanding the world" to "creating the world."

Generative Foundation Model Taxonomy:

Text → Image: Stable Diffusion, DALL-E, Midjourney
Text → Video: Sora, Runway Gen-3, Kling
Text → 3D:    DreamFusion, Zero-1-to-3
Text → Audio: AudioLM, MusicGen, Bark
Any  → Any:   Unified Generation Models (CoDi, NExT-GPT)

Diffusion as a Generative Foundation

The Diffusion Model is currently the most dominant generative foundation. For detailed principles, see the Diffusion notes.

Core Review

A Diffusion Model is defined by two processes:

Forward process (adding noise):

\[ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) \]

Reverse process (denoising):

\[ p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \]

The training objective simplifies to noise prediction:

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] \]

Conditional Generation: Classifier-Free Guidance (CFG)

CFG is a key technique for balancing generation quality and condition adherence:

\[ \hat{\epsilon}_\theta(x_t, c) = (1 + w) \epsilon_\theta(x_t, c) - w \epsilon_\theta(x_t, \varnothing) \]

Here, \(w\) is the guidance scale, \(c\) is the condition (e.g., text), and \(\varnothing\) is the null condition. A larger \(w\) produces outputs more closely aligned with the condition, at the cost of reduced diversity.


DiT: Replacing U-Net with Transformer

DiT (Diffusion Transformer), proposed by Peebles & Xie (2023), replaces the traditional U-Net backbone with a Transformer. For details, see the DiT notes.

Key Improvements

Traditional Diffusion:  Noise → U-Net (CNN-based) → Denoised Image
DiT:                    Noise → Transformer (ViT-based) → Denoised Image
  • Noisy images are split into patches, similar to ViT processing
  • Timestep \(t\) and condition \(c\) are injected via AdaLN-Zero
  • Scaling laws apply equally: larger DiT models produce higher-quality generations

The success of DiT demonstrates that Transformers exhibit superior scaling properties for generative tasks as well, laying the technical foundation for subsequent large-scale generative models such as Sora.


Text-to-Image

Stable Diffusion (Stability AI)

Built on the Latent Diffusion Model (LDM), Stable Diffusion performs diffusion in a compressed latent space, drastically reducing computational cost.

Stable Diffusion Architecture:

Text → CLIP Text Encoder → Text Embeddings
                                ↓ (Cross-Attention)
Random Noise → U-Net (Latent Space) → Denoised Latent
                                ↓
                   VAE Decoder → Image (512x512 / 1024x1024)

Key components:

  • VAE: Compresses pixel space into latent space (typically \(8 \times\) downsampling)
  • U-Net / DiT: Performs denoising in latent space
  • Text Encoder: Encodes text conditions using CLIP or T5
  • Scheduler: Controls the sampling process (DDPM, DDIM, DPM-Solver, etc.)

Version evolution:

Version Backbone Resolution Text Encoder
SD 1.5 U-Net 512x512 CLIP ViT-L/14
SDXL U-Net (larger) 1024x1024 CLIP + OpenCLIP
SD 3 DiT (MMDiT) Multi-resolution CLIP + T5-XXL

DALL-E Series (OpenAI)

  • DALL-E (2021): Based on VQ-VAE + Autoregressive Transformer
  • DALL-E 2 (2022): Based on CLIP + Diffusion (unCLIP architecture)
  • DALL-E 3 (2023): Improved text understanding; uses T5 to rewrite captions for better text-image alignment

Midjourney

The most commercially successful text-to-image product, renowned for its artistic style. Specific technical details have not been publicly disclosed.


Text-to-Video

Sora (OpenAI, 2024)

Sora is a landmark model in text-to-video generation, demonstrating the potential of "video as a world simulator."

Speculated core technology:

Sora Architecture (speculated):

Video → VAE (Spatiotemporal Compression) → Spacetime Latent Patches
                              ↓
Text → Text Encoder → Conditioning
                              ↓
              DiT (Spacetime Transformer)
                              ↓
              VAE Decoder → Video Output

Key characteristics:

  • Spacetime patches: Treats video as a sequence of 3D patches, uniformly handling varying resolutions and durations
  • DiT backbone: Inherits DiT's scaling properties
  • Long video generation: Capable of generating coherent videos up to 1 minute
  • Physical understanding: Demonstrates a degree of 3D consistency and understanding of physical laws

Other Video Generation Models

Model Organization Highlights
Runway Gen-3 Runway Production-grade video generation with multiple control modes
Kling Kuaishou Long video generation with strong physics simulation
Pika Pika Labs Consumer-oriented video editing and generation
CogVideo Zhipu AI Open-source video generation model

Text-to-3D

Text-to-3D generation is a rapidly evolving field, with the core challenge being the scarcity of 3D data.

Optimization-Based Methods

DreamFusion (Poole et al., 2022):

Core idea: Leverage a pretrained 2D diffusion model to provide gradient signals for optimizing a 3D representation (NeRF).

\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t, \epsilon} \left[ w(t) (\epsilon_\phi(x_t; y, t) - \epsilon) \frac{\partial x}{\partial \theta} \right] \]

Here, the SDS (Score Distillation Sampling) loss distills knowledge from a 2D diffusion model into a 3D model.

Feed-Forward Methods

  • Zero-1-to-3 (2023): Generates multi-view images from a single input image
  • LRM (Large Reconstruction Model, 2023): A Transformer that directly predicts 3D representations
  • InstantMesh (2024): Combines multi-view generation with 3D reconstruction

Text-to-Audio

AudioLM (Google, 2022)

Models audio as a sequence of discrete tokens, generating audio using a language model paradigm.

AudioLM Pipeline:

Audio → Neural Codec (e.g., SoundStream) → Discrete Tokens
Tokens → Transformer Language Model → Generated Tokens
Generated Tokens → Codec Decoder → Audio Waveform

MusicGen (Meta, 2023)

A model focused on music generation:

  • Uses EnCodec to encode audio into multi-layer discrete tokens
  • Introduces a "delay pattern" to solve parallel generation across multiple codebooks
  • Supports both text descriptions and melody as conditional inputs

Other Audio Generation Models

Model Type Highlights
Bark (Suno) Speech generation Supports multiple languages and non-verbal sounds
Stable Audio Music/Sound effects Based on Latent Diffusion
VALL-E (Microsoft) Voice cloning Clones a voice from just 3 seconds of reference audio

Unified Generation: Any-to-Any Models

The goal of unified generation models is to support arbitrary cross-modal transformations within a single model.

Two Technical Approaches

Approach A: LLM as the brain + external generation models

Approach A:

Input (any modality) → Encoder → LLM (understanding + planning) → Instructions
                                                                    ↓
                                    External Generation Models (Diffusion/Codec) → Output (any modality)

Representative models: NExT-GPT, Visual ChatGPT

Advantages: Reuses existing powerful unimodal generation models.

Disadvantages: End-to-end optimization is difficult; information loss between modules.

Approach B: Unified discrete token system

Approach B:

Input (any modality) → VQ Tokenizer → Discrete Tokens → Transformer → Output Tokens → Detokenizer → Output

Representative models: Chameleon (Meta), Gemini

Advantages: End-to-end training; natural cross-modal interaction.

Disadvantages: Discretization incurs information loss; training is challenging.

Representative Models

  • CoDi (Microsoft): Achieves any-to-any generation by aligning multimodal latent spaces
  • Chameleon (Meta, 2024): Unifies text and images as discrete tokens, processed by a single Transformer
  • Gemini (Google): Natively multimodal, supporting text, image, audio, and video as both input and output

A Unified Perspective on Generative Foundations

Regardless of whether the target is image, video, 3D, or audio generation, the core can be summarized as:

\[ p_\theta(x | c) = \text{GenerativeModel}(c; \theta) \]

where \(c\) is the condition (text, image, etc.) and \(x\) is the output in the target modality.

Unified Framework for Generative Foundations:

Condition c → Condition Encoder → Condition Features
                                      ↓
Noise / Initial Tokens → Generation Backbone (DiT / Autoregressive) → Denoising / Decoding
                                      ↓
                                 Decoder → Target Modality Output

Current trends:

  1. Backbone unification: DiT is progressively becoming the standard architecture for image and video generation
  2. Modality expansion: From text-to-image to text-to-video, 3D, and audio
  3. Quality improvement: Continuous gains through scaling, better data, and stronger condition injection
  4. Controllability: Fine-grained control via techniques such as ControlNet and IP-Adapter
  5. Flow Matching: As an alternative to / improvement over Diffusion, offering a more direct training objective (see Flow Matching notes)

评论 #