Generative Foundation Models
Overview
Generative foundation models are large-scale pretrained models capable of generating high-quality content (images, video, 3D, audio, etc.) given conditional inputs. Unlike discriminative foundation models, the core objective of generative foundations is to learn the underlying data distribution and sample from it.
Core paradigm: from "understanding the world" to "creating the world."
Generative Foundation Model Taxonomy:
Text → Image: Stable Diffusion, DALL-E, Midjourney
Text → Video: Sora, Runway Gen-3, Kling
Text → 3D: DreamFusion, Zero-1-to-3
Text → Audio: AudioLM, MusicGen, Bark
Any → Any: Unified Generation Models (CoDi, NExT-GPT)
Diffusion as a Generative Foundation
The Diffusion Model is currently the most dominant generative foundation. For detailed principles, see the Diffusion notes.
Core Review
A Diffusion Model is defined by two processes:
Forward process (adding noise):
Reverse process (denoising):
The training objective simplifies to noise prediction:
Conditional Generation: Classifier-Free Guidance (CFG)
CFG is a key technique for balancing generation quality and condition adherence:
Here, \(w\) is the guidance scale, \(c\) is the condition (e.g., text), and \(\varnothing\) is the null condition. A larger \(w\) produces outputs more closely aligned with the condition, at the cost of reduced diversity.
DiT: Replacing U-Net with Transformer
DiT (Diffusion Transformer), proposed by Peebles & Xie (2023), replaces the traditional U-Net backbone with a Transformer. For details, see the DiT notes.
Key Improvements
Traditional Diffusion: Noise → U-Net (CNN-based) → Denoised Image
DiT: Noise → Transformer (ViT-based) → Denoised Image
- Noisy images are split into patches, similar to ViT processing
- Timestep \(t\) and condition \(c\) are injected via AdaLN-Zero
- Scaling laws apply equally: larger DiT models produce higher-quality generations
The success of DiT demonstrates that Transformers exhibit superior scaling properties for generative tasks as well, laying the technical foundation for subsequent large-scale generative models such as Sora.
Text-to-Image
Stable Diffusion (Stability AI)
Built on the Latent Diffusion Model (LDM), Stable Diffusion performs diffusion in a compressed latent space, drastically reducing computational cost.
Stable Diffusion Architecture:
Text → CLIP Text Encoder → Text Embeddings
↓ (Cross-Attention)
Random Noise → U-Net (Latent Space) → Denoised Latent
↓
VAE Decoder → Image (512x512 / 1024x1024)
Key components:
- VAE: Compresses pixel space into latent space (typically \(8 \times\) downsampling)
- U-Net / DiT: Performs denoising in latent space
- Text Encoder: Encodes text conditions using CLIP or T5
- Scheduler: Controls the sampling process (DDPM, DDIM, DPM-Solver, etc.)
Version evolution:
| Version | Backbone | Resolution | Text Encoder |
|---|---|---|---|
| SD 1.5 | U-Net | 512x512 | CLIP ViT-L/14 |
| SDXL | U-Net (larger) | 1024x1024 | CLIP + OpenCLIP |
| SD 3 | DiT (MMDiT) | Multi-resolution | CLIP + T5-XXL |
DALL-E Series (OpenAI)
- DALL-E (2021): Based on VQ-VAE + Autoregressive Transformer
- DALL-E 2 (2022): Based on CLIP + Diffusion (unCLIP architecture)
- DALL-E 3 (2023): Improved text understanding; uses T5 to rewrite captions for better text-image alignment
Midjourney
The most commercially successful text-to-image product, renowned for its artistic style. Specific technical details have not been publicly disclosed.
Text-to-Video
Sora (OpenAI, 2024)
Sora is a landmark model in text-to-video generation, demonstrating the potential of "video as a world simulator."
Speculated core technology:
Sora Architecture (speculated):
Video → VAE (Spatiotemporal Compression) → Spacetime Latent Patches
↓
Text → Text Encoder → Conditioning
↓
DiT (Spacetime Transformer)
↓
VAE Decoder → Video Output
Key characteristics:
- Spacetime patches: Treats video as a sequence of 3D patches, uniformly handling varying resolutions and durations
- DiT backbone: Inherits DiT's scaling properties
- Long video generation: Capable of generating coherent videos up to 1 minute
- Physical understanding: Demonstrates a degree of 3D consistency and understanding of physical laws
Other Video Generation Models
| Model | Organization | Highlights |
|---|---|---|
| Runway Gen-3 | Runway | Production-grade video generation with multiple control modes |
| Kling | Kuaishou | Long video generation with strong physics simulation |
| Pika | Pika Labs | Consumer-oriented video editing and generation |
| CogVideo | Zhipu AI | Open-source video generation model |
Text-to-3D
Text-to-3D generation is a rapidly evolving field, with the core challenge being the scarcity of 3D data.
Optimization-Based Methods
DreamFusion (Poole et al., 2022):
Core idea: Leverage a pretrained 2D diffusion model to provide gradient signals for optimizing a 3D representation (NeRF).
Here, the SDS (Score Distillation Sampling) loss distills knowledge from a 2D diffusion model into a 3D model.
Feed-Forward Methods
- Zero-1-to-3 (2023): Generates multi-view images from a single input image
- LRM (Large Reconstruction Model, 2023): A Transformer that directly predicts 3D representations
- InstantMesh (2024): Combines multi-view generation with 3D reconstruction
Text-to-Audio
AudioLM (Google, 2022)
Models audio as a sequence of discrete tokens, generating audio using a language model paradigm.
AudioLM Pipeline:
Audio → Neural Codec (e.g., SoundStream) → Discrete Tokens
Tokens → Transformer Language Model → Generated Tokens
Generated Tokens → Codec Decoder → Audio Waveform
MusicGen (Meta, 2023)
A model focused on music generation:
- Uses EnCodec to encode audio into multi-layer discrete tokens
- Introduces a "delay pattern" to solve parallel generation across multiple codebooks
- Supports both text descriptions and melody as conditional inputs
Other Audio Generation Models
| Model | Type | Highlights |
|---|---|---|
| Bark (Suno) | Speech generation | Supports multiple languages and non-verbal sounds |
| Stable Audio | Music/Sound effects | Based on Latent Diffusion |
| VALL-E (Microsoft) | Voice cloning | Clones a voice from just 3 seconds of reference audio |
Unified Generation: Any-to-Any Models
The goal of unified generation models is to support arbitrary cross-modal transformations within a single model.
Two Technical Approaches
Approach A: LLM as the brain + external generation models
Approach A:
Input (any modality) → Encoder → LLM (understanding + planning) → Instructions
↓
External Generation Models (Diffusion/Codec) → Output (any modality)
Representative models: NExT-GPT, Visual ChatGPT
Advantages: Reuses existing powerful unimodal generation models.
Disadvantages: End-to-end optimization is difficult; information loss between modules.
Approach B: Unified discrete token system
Approach B:
Input (any modality) → VQ Tokenizer → Discrete Tokens → Transformer → Output Tokens → Detokenizer → Output
Representative models: Chameleon (Meta), Gemini
Advantages: End-to-end training; natural cross-modal interaction.
Disadvantages: Discretization incurs information loss; training is challenging.
Representative Models
- CoDi (Microsoft): Achieves any-to-any generation by aligning multimodal latent spaces
- Chameleon (Meta, 2024): Unifies text and images as discrete tokens, processed by a single Transformer
- Gemini (Google): Natively multimodal, supporting text, image, audio, and video as both input and output
A Unified Perspective on Generative Foundations
Regardless of whether the target is image, video, 3D, or audio generation, the core can be summarized as:
where \(c\) is the condition (text, image, etc.) and \(x\) is the output in the target modality.
Unified Framework for Generative Foundations:
Condition c → Condition Encoder → Condition Features
↓
Noise / Initial Tokens → Generation Backbone (DiT / Autoregressive) → Denoising / Decoding
↓
Decoder → Target Modality Output
Current trends:
- Backbone unification: DiT is progressively becoming the standard architecture for image and video generation
- Modality expansion: From text-to-image to text-to-video, 3D, and audio
- Quality improvement: Continuous gains through scaling, better data, and stronger condition injection
- Controllability: Fine-grained control via techniques such as ControlNet and IP-Adapter
- Flow Matching: As an alternative to / improvement over Diffusion, offering a more direct training objective (see Flow Matching notes)