Generative 3D: Diffusion-Driven 3D Reconstruction and Generation

Generative 3D refers to a class of methods that leverage 2D generative model priors (primarily diffusion models) to generate or reconstruct 3D representations. The motivation comes from a stark observation: 3D data on the internet is extremely scarce (Objaverse has only ~10M assets), while 2D image data is essentially unlimited. If we can "distill" the visual priors that a 2D diffusion model has already learned into 3D representations — or train a diffusion model that produces multi-view consistent images — we can sidestep the 3D data scarcity bottleneck.

Starting from DreamFusion in September 2022, this line has fragmented into five major schools within three years: from per-scene SDS / VSD optimization, to end-to-end multi-view diffusion, to second-level feed-forward reconstruction (the LRM family), and converging in 2024 toward structured-latent models (Trellis, Hunyuan3D 2.0). This article systematically traces these five paths — their methodology, key formulas, mutual relationships, and boundaries.

Overview

Task Definition

Generative 3D solves a problem that can be uniformly stated as:

\[ \hat{\theta} = \arg\max_{\theta} \, p(\theta \mid c) \]

where $\theta$ is the parameter of some 3D representation (NeRF weights, a set of 3D Gaussian ellipsoids, mesh vertices + texture, Triplane features, etc.), and $c$ is the conditioning input (a text prompt, a single image, or a few sparse views).

The key challenge: $p(\theta \mid c)$ has no large-scale dataset for direct supervision. The core innovation of generative 3D is answering "how to indirectly approximate this distribution using 2D diffusion priors."

Boundaries with Adjacent Directions

	Generative 3D	Geometric 3D Reconstruction	World Models
Input	Single image / text / few views	Dense multi-view (dozens of images)	Current state + action
Output	Static 3D asset	Static 3D scene	Temporal state prediction
Core mechanism	2D diffusion priors	Geometry + volume rendering	Temporal modeling
Representative methods	DreamFusion / LRM / Trellis	NeRF / 3DGS / SfM-MVS	Dreamer / Genie / Cosmos
Time dimension	Static	Static	Dynamic

See § 9 Boundaries with Adjacent Topics at the end of this article for details.

Timeline Panorama

graph LR
    A[2022.09<br/>DreamFusion<br/>SDS] --> B[2023.03<br/>Zero-1-to-3<br/>view-conditioned diffusion]
    B --> C[2023.05<br/>ProlificDreamer<br/>VSD]
    C --> D[2023.11<br/>LRM<br/>feed-forward reconstruction]
    D --> E[2023.12<br/>ReconFusion<br/>sparse-view + diffusion]
    E --> F[2024.03<br/>InstantMesh<br/>LRM + Mesh]
    F --> G[2024.05<br/>SV3D / CAT3D<br/>video diffusion as 3D prior]
    G --> H[2024.12<br/>Trellis<br/>structured latent]
    H --> I[2025.01<br/>Hunyuan3D 2.0<br/>shape + texture dual diffusion]

The Five Schools

graph TD
    Root[Generative 3D] --> P1[1. SDS Optimization<br/>per-scene]
    Root --> P2[2. VSD Optimization<br/>particle distribution]
    Root --> P3[3. Multi-View Diffusion<br/>multi-view diffusion]
    Root --> P4[4. Feed-Forward Reconstruction<br/>feed-forward]
    Root --> P5[5. Sparse-View Reconstruction<br/>sparse-view]

    P1 --> P1a[DreamFusion / SJC<br/>Magic3D / Latent-NeRF<br/>Fantasia3D]
    P2 --> P2a[ProlificDreamer<br/>DreamGaussian<br/>GaussianDreamer]
    P3 --> P3a[Zero-1-to-3 / MVDream<br/>SyncDreamer / Wonder3D<br/>SV3D / Zero123++]
    P4 --> P4a[LRM / InstantMesh<br/>TripoSR / LGM / GRM<br/>Trellis / Hunyuan3D]
    P5 --> P5a[ReconFusion / CAT3D<br/>ZeroNVS<br/>pixelSplat / MVSplat]

Reading Guide

The article unfolds in a "methodology first, representative works second" order. If you only want a quick overview of the SOTA, jump directly to § 5.3 Structured-Latent School and § 8 Method Panorama. For a deep understanding of the SDS / VSD mathematics, § 2 and § 3 give complete derivations.

Prerequisites

Reading this article requires familiarity with:

DDPM / DDIM / Latent Diffusion (see Diffusion.md)
NeRF / 3D Gaussian Splatting (see 3D视觉.md)
Basic concepts of Score Matching (Diffusion.md § Score-Based View)

1. Choice of 3D Representation

The first design decision in generative 3D is: what to use as the 3D representation. This choice directly affects the optimization scheme, training cost, inference speed, and editability.

1.1 Five Mainstream 3D Representations

Representation	Data structure	Rendering	Differentiable	Memory	Diffusion-friendly
NeRF	MLP implicit field $f_\theta(\mathbf{x}, \mathbf{d}) \to (c, \sigma)$	Volume rendering (ray marching)	✅	Small (network weights)	❌ (continuous function hard to discretize)
3D Gaussian Splatting	Explicit Gaussian ellipsoid set $\{\boldsymbol{\mu}_i, \boldsymbol{\Sigma}_i, \alpha_i, \mathbf{c}_i\}$	Differentiable rasterization	✅	Medium ($10^5$-$10^6$ Gaussians)	⚠️ (unordered set hard to diffuse, needs structuring)
Mesh	Vertices + faces + texture	Rasterization	✅ (DMTet/FlexiCubes)	Small	❌ (topology not differentiable)
Triplane	Three orthogonal 2D feature maps + decoder MLP	Volume rendering	✅	Medium	✅ (2D feature maps natural for diffusion)
Voxel	Regular 3D grid	Volume rendering	✅	Large ($O(N^3)$)	✅ (3D conv handles naturally)

1.2 Different Preferences: Optimization vs. Feed-Forward

Optimization school (per-scene) prefers representations that are fast-rendering, differentiable, and parameter-dense:

Early: NeRF (DreamFusion / SJC / Magic3D)
Modern: 3D Gaussian Splatting (DreamGaussian / GaussianDreamer)

3DGS reduced SDS optimization time from DreamFusion's 1.5 hours to 2-5 minutes — the largest engineering breakthrough of this line in 2024.

Feed-forward school prefers representations that transformers can predict directly:

LRM family: Triplane (three 2D feature maps, perfectly suited for transformers)
LGM / GRM: 3DGS (each Gaussian as a token)
Trellis: sparse structured latent (placing tokens only near the object surface)
Hunyuan3D 2.0: Signed Distance Field + diffusion

1.3 The Rise of Structured Latents

The key trend in 2024 is: no longer diffusing directly on a 3D representation, but diffusing on a "middle latent" and then decoding to specific 3D representations. This mirrors the success of Stable Diffusion in 2D.

Representative works:

Trellis: Sparse Structured Latent (SLAT), placing latent tokens only on voxels with occupancy above a threshold
Hunyuan3D 2.0: First diffusing the SDF, then separately diffusing the texture
3DTopia-XL: Triplane latent + diffusion

2. SDS Optimization School

2.1 DreamFusion and the Birth of SDS

Score Distillation Sampling (SDS), proposed by DreamFusion (Poole et al., 2022), is the foundational work of this line. The problem it addresses:

We have a pretrained 2D text-to-image diffusion model $\epsilon_\phi$. Given a text prompt $y$ (e.g., "a hamburger"), how do we optimize a 3D representation $\theta$ (NeRF parameters) such that 2D images $\mathbf{x} = g(\theta, \pi)$ rendered from any viewpoint $\pi$ "look like" what $\epsilon_\phi$ would generate from $y$?

The naive idea: at each viewpoint $\pi$, sample a $\hat{\mathbf{x}} \sim \epsilon_\phi(\cdot \mid y)$, then make the rendered image $g(\theta, \pi)$ match $\hat{\mathbf{x}}$. Problem: the $\hat{\mathbf{x}}$ sampled at different viewpoints are inconsistent with each other, and the 3D representation gets pulled into a multi-faced monster.

SDS's key insight: we don't actually need to sample images from $\epsilon_\phi$; we only need the diffusion model's "scoring gradient" on the current rendered image.

2.2 SDS Derivation

Let $q_\phi(\mathbf{x} \mid y)$ denote the 2D image distribution implicitly defined by the diffusion model. We want to minimize:

\[ \mathcal{L}_{\text{Diff}}(\theta, y) = \mathbb{E}_{t, \pi, \epsilon} \left[ \omega(t) \, \| \epsilon_\phi(\mathbf{x}_t; y, t) - \epsilon \|^2 \right] \]

where $\mathbf{x} = g(\theta, \pi)$ and $\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon$.

Taking the gradient w.r.t. $\theta$ (chain rule):

\[ \nabla_\theta \mathcal{L}_{\text{Diff}} = \mathbb{E}\left[\omega(t) \big(\epsilon_\phi(\mathbf{x}_t; y, t) - \epsilon\big) \cdot \underbrace{\frac{\partial \epsilon_\phi}{\partial \mathbf{x}_t}}_{\text{U-Net Jacobian}} \cdot \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}} \cdot \frac{\partial \mathbf{x}}{\partial \theta}\right] \]

Problem: the U-Net Jacobian $\partial \epsilon_\phi / \partial \mathbf{x}_t$ is extremely expensive and numerically unstable to compute.

SDS's approximation: simply drop this term, yielding:

\[ \boxed{ \nabla_\theta \mathcal{L}_{\text{SDS}} \;=\; \mathbb{E}_{t, \pi, \epsilon} \left[ \omega(t) \big(\epsilon_\phi(\mathbf{x}_t; y, t) - \epsilon\big) \frac{\partial \mathbf{x}}{\partial \theta} \right] } \]

Physical interpretation: regard $\epsilon_\phi(\mathbf{x}_t; y, t) - \epsilon$ as "the direction in which the current rendered image deviates from the prompt description." This direction is back-propagated through the renderer $\partial \mathbf{x} / \partial \theta$ to the 3D parameters $\theta$.

2.3 The Essence of SDS: Score Distillation

A deeper understanding: SDS is equivalent to score-function matching. Recall the score-based view (Diffusion.md § Score-Based View):

\[ \epsilon_\phi(\mathbf{x}_t; y, t) \approx -\sigma_t \nabla_{\mathbf{x}_t} \log q_\phi(\mathbf{x}_t \mid y) \]

Substituting into the SDS gradient:

\[ \nabla_\theta \mathcal{L}_{\text{SDS}} \propto -\mathbb{E}_t\left[\omega(t) \sigma_t \nabla_{\mathbf{x}_t} \log q_\phi(\mathbf{x}_t \mid y) \cdot \frac{\partial \mathbf{x}}{\partial \theta}\right] \]

This is precisely "pulling" the score of $q_\phi(\mathbf{x} \mid y)$ through the rendering pipeline into the 3D parameter space.

2.4 Improvements

Score Jacobian Chaining (SJC, Wang et al. 2022)

Proposed nearly concurrently with SDS, SJC derives almost identical update rules from a score-based SDE perspective, but explicitly handles Perturb-and-Average Scoring (PAAS) to mitigate SDS's numerical noise.

Latent-NeRF (Metzer et al. 2022)

Renders NeRF into the Stable Diffusion latent space instead of RGB space, avoiding the VAE encode overhead at every SDS step. The cost: rendering quality is bounded by the SD VAE decoder.

Magic3D (Lin et al. 2023)

Two-stage coarse-to-fine:

Stage 1: low-resolution NeRF + SDS, optimizing geometry rapidly with 64×64 rendering (about 25 minutes)
Stage 2: extract a mesh from NeRF (DMTet), switch to high-resolution (512×512) + SDS for texture refinement (about 15 minutes)

Total time ≈ 40 minutes, with quality significantly better than DreamFusion. It is the engineering exemplar of the SDS school.

Fantasia3D (Chen et al. 2023)

Decouples geometry from appearance: first uses DMTet + SDS to optimize the geometric shape (rendering only normal maps), then fixes geometry and uses PBR materials + SDS to optimize appearance. This separation yields cleaner geometry and more realistic materials.

2.5 Inherent Problems of SDS

The Janus Problem (Multi-Faced)

The most famous failure mode: a 3D object viewed from any angle has a "front face" (e.g., a multi-faced lion).

Causes:

The training data of 2D diffusion models is dominated by frontal views; the prior $p(\mathbf{x} \mid y)$ assigns significantly higher probability to frontal viewpoints than to side or back views
SDS's mode-seeking nature: at each viewpoint, it greedily approaches the highest-probability mode → every viewpoint gets pulled toward the front

Mitigations:

View-conditioned prompting: prepend prefixes based on camera azimuth ("a photo of X, front view" / "back view")
Progressive view sampling: train frontal views first, then expand to side/back
Debiased SDS (Hong et al. 2023): importance sampling on azimuth angles
Perp-Neg (Armandpour et al. 2023): apply negative prompts at opposite viewpoints
3D-aware diffusion priors: replace the prior with multi-view diffusion at the root (see § 4)

Over-Saturation and Over-Smoothing

SDS tends to produce results that are over-saturated and over-smoothed in texture.

Cause: mode-seeking concentrates the result in the highest-density region of the prior, losing diversity. This is exactly the core issue that VSD in § 3 addresses.

High Computational Cost

Per-scene optimization takes 1.5-2 hours (original DreamFusion on V100). Even Magic3D requires 40 minutes. This drove the SDS+3DGS acceleration in § 3.2 and the rise of feed-forward reconstruction in § 5.

3. VSD Optimization School

3.1 ProlificDreamer and VSD

ProlificDreamer (Wang et al. 2023) proposed Variational Score Distillation (VSD), fundamentally rethinking the SDS objective.

Core argument: SDS treats the 3D parameters $\theta$ as deterministic variables to optimize (toward the highest-probability point of the prior), which is the root of mode-seeking. VSD treats $\theta$ as a distribution $\mu(\theta \mid y)$ to optimize — the goal is not to find a $\theta^*$, but to find a distribution $\mu$ over $\theta$.

3.2 VSD Derivation

VSD's optimization objective: make the image distribution rendered by $\mu$, $q^\mu_t(\mathbf{x}_t \mid y)$, minimize the KL with the diffusion prior $p_t(\mathbf{x}_t \mid y)$ at all $t$:

\[ \min_{\mu} \, \mathbb{E}_t \left[ \omega(t) \, D_{\text{KL}}\left(q^\mu_t(\mathbf{x}_t \mid c, y) \,\Vert\, p_t(\mathbf{x}_t \mid y)\right) \right] \]

Taking the Wasserstein gradient w.r.t. $\mu$, we obtain the VSD update rule:

\[ \boxed{ \nabla_\theta \mathcal{L}_{\text{VSD}} \;=\; \mathbb{E}_{t,\pi,\epsilon} \left[ \omega(t) \big( \underbrace{\epsilon_\phi(\mathbf{x}_t; y, t)}_{\text{prior score}} - \underbrace{\epsilon_{\phi'}(\mathbf{x}_t; y, t, \pi)}_{\text{score of current } q^\mu} \big) \frac{\partial \mathbf{x}}{\partial \theta} \right] } \]

Comparison with SDS:

	SDS	VSD
Subtracted "baseline"	Random noise $\epsilon$	Score of current particle distribution $\epsilon_{\phi'}$
Goal	Mode-seeking	Mode-coverage
Diversity	Low (saturated, averaged)	High
CFG weight	Often 100+	7.5 (consistent with 2D generation)
Training cost	Medium	High (online LoRA training)

3.3 Estimating the Particle-Distribution Score with LoRA

VSD's key engineering implementation: use an online-trained LoRA (on top of the pretrained $\epsilon_\phi$) to estimate $\epsilon_{\phi'}$.

The flow: at each SDS step, use the current rendered image as a training sample for the LoRA, performing one diffusion training step:

Render the image $\mathbf{x}$ at viewpoint $\pi$ with current $\theta$
Add noise: $\mathbf{x}_t = \alpha_t \mathbf{x} + \sigma_t \epsilon$
LoRA training: train $\epsilon_{\phi'}$ for one step with $\mathbf{x}_t \to \epsilon$
VSD update: update $\theta$ with $\epsilon_\phi(\mathbf{x}_t) - \epsilon_{\phi'}(\mathbf{x}_t)$

The LoRA constantly chases the current particle distribution, and the two reach a dynamic equilibrium.

3.4 The Effect of VSD

ProlificDreamer on text-to-3D:

Geometric detail significantly surpasses DreamFusion / Magic3D
Texture saturation is natural, no longer over-exposed
Repeated runs with the same prompt produce diverse results (a direct manifestation of mode-coverage)

The cost: training time increases from SDS's hour-scale to about 2-4 hours.

3.5 SDS + 3DGS Acceleration

VSD has good quality but is slow. A parallel path: keep SDS, replace with a more efficient 3D representation.

DreamGaussian (Tang et al. 2023)

The first work to bring 3D Gaussian Splatting into SDS optimization. Two stages:

Stage 1 (about 2 minutes): 3DGS + SDS, optimizing geometry from noise
Stage 2 (about 1 minute): extract a mesh + UV from 3DGS, then run SDS for texture optimization

Total time 3-5 minutes, more than 20× faster than DreamFusion.

Why does 3DGS make SDS so fast?

3DGS rendering is 10-100× faster than NeRF (rasterization vs. ray marching)
3DGS's "adaptive density control" (clone/split/prune) makes optimization more stable
3DGS's explicit representation makes "adding more capacity" trivial (just add Gaussians)

GaussianDreamer (Yi et al. 2023)

Initializes 3DGS with the output of Shap-E (OpenAI's 3D diffusion model), then runs SDS optimization. Shap-E provides a coarse geometric prior that prevents SDS from getting trapped in local optima when starting from noise.

LucidDreamer / DreamCraft3D

Further combine SDS / VSD with 3DGS / mesh, and introduce additional geometric regularizers (normal smoothness, occupancy consistency) to mitigate high-frequency artifacts.

4. Multi-View Diffusion School

The fundamental dilemma of SDS / VSD: a single 2D diffusion model does not know "what 3D is." The multi-view diffusion school takes a different approach: directly train a diffusion model whose outputs are multi-view consistent images, injecting 3D consistency at the source.

4.1 Zero-1-to-3: View Conditioning

Zero-1-to-3 (Liu et al. 2023) is the seminal work of this line. It fine-tunes Stable Diffusion into:

\[ \epsilon_\phi(\mathbf{x}_t; \, y_{\text{ref}}, \, \Delta\pi, \, t) \]

where $y_{\text{ref}}$ is the image embedding (CLIP) of a reference view, and $\Delta\pi = (\Delta\theta, \Delta\phi, \Delta r)$ is the relative camera pose (elevation, azimuth, distance). The model learns to generate a new-view image given a reference image and a relative camera pose.

Training data: Objaverse (about 800K 3D assets), with each asset rendered from 12 uniformly distributed views, forming triples $(\mathbf{x}_{\text{ref}}, \mathbf{x}_{\text{tgt}}, \Delta\pi)$.

Inference: given a single image, "imagine" any new viewpoint. This is the first time Stable Diffusion gained the concept of 3D.

But Zero-1-to-3 still samples each view independently — there is no explicit consistency constraint between views, so when stitched together for 3D reconstruction there are gaps. This drove the next generation of multi-view joint diffusion.

4.2 Paths to Multi-View Consistency

Cross-View Self-Attention (MVDream)

MVDream (Shi et al. 2023) feeds 4 views into the U-Net simultaneously, and concatenates the tokens of all views for self-attention:

Original SD: U-Net processes a single image [B, C, H, W]
MVDream: U-Net processes a 4-view batch [B, 4, C, H, W]
         self-attn across all spatial positions in the 4 views

Effect: the 4 views see each other in the attention layers, naturally producing consistency.

Epipolar Attention (SyncDreamer)

SyncDreamer (Liu et al. 2023) introduces attention constrained by epipolar geometry: each pixel attends only to points on the epipolar lines in other views. This is a stronger geometric prior, but requires known camera intrinsics and extrinsics.

Video Diffusion as 3D Prior (SV3D / CAT3D)

The latest trend: use video diffusion models as 3D priors. Video diffusion has naturally learned "adjacent frames should be consistent"; treating "continuous viewpoints" as "continuous frames" allows reuse.

SV3D (Voleti et al. 2024): based on Stable Video Diffusion, takes a single image and outputs 21 continuous orbiting views
CAT3D (Gao et al. 2024): video diffusion as a strong prior for novel-view synthesis; even few-view inputs can reconstruct

Architecture Comparison

graph TD
    A[Multi-View Diffusion Consistency] --> B[Cross-view self-attn<br/>MVDream]
    A --> C[Epipolar attn<br/>SyncDreamer]
    A --> D[Video diffusion<br/>SV3D / CAT3D]

    B --> B1[Simple<br/>No camera params needed<br/>Fixed view count]
    C --> C1[Strict geometric prior<br/>Requires camera params<br/>Good generalization]
    D --> D1[Leverages video pretraining<br/>Dense views<br/>Computationally expensive]

4.3 Summary of Representative Works

Method	Output	Consistency mechanism	Input condition
Zero-1-to-3 (2023)	Single new view	Implicit (shared weights)	Single image + relative pose
Zero123++ (2023)	6 fixed views	Tiling + cross-attn	Single image
MVDream (2023)	4 fixed views	Cross-view self-attn	Text
SyncDreamer (2023)	16 views	Epipolar attn	Single image
Wonder3D (2023)	RGB + Normal × N views	Cross-domain attn	Single image
ImageDream (2023)	4 views	Cross-view self-attn	Single image
SV3D (2024)	21 frames of orbiting views	Video diffusion	Single image
CAT3D (2024)	N new views	Video diffusion	1-N images + target poses

4.4 Multi-View Diffusion → 3D Downstream

Multi-view diffusion alone does not directly output 3D; subsequent reconstruction is needed:

Multi-view diffusion generates N consistent images
Use NeRF / 3DGS / learning-based MVS to reconstruct 3D from these N images
(Optional) SDS fine-tuning

The bottleneck of this pipeline is that step 2 is still slow. This drove the § 5 paradigm of directly outputting 3D in a feed-forward manner.

5. Feed-Forward Reconstruction School

The feed-forward school's goal: produce 3D in seconds to tens of seconds, with no per-scene optimization.

5.1 LRM: The Foundational Work

LRM (Large Reconstruction Model, Hong et al. 2023) is the seminal work of this line, with a very direct idea:

If transformers can predict images from text (DALL-E, Imagen), they can also predict 3D representations from images.

Architecture:

Input: single image (256×256)
  ↓
DINO ViT encoder → image tokens
  ↓
Cross-attention transformer (10+ layers)
  ↓
Predict Triplane features (3 × 32×32×80)
  ↓
+ small MLP decoder (shared)
  ↓
NeRF rendering: volume sampling → RGB

Training: end-to-end supervision on Objaverse + MVImgNet — given a single image, directly predict a Triplane that can render multi-view ground truth. The loss is L2 + LPIPS between rendered images and GT multi-views.

Result: 3D in 5 seconds from a single image, with quality matching what early SDS would take an hour to optimize. A paradigm-shifting moment for this line.

5.2 LRM Descendants

InstantMesh (Xu et al. 2024)

LRM's output is NeRF (implicit), which is unfriendly to downstream applications like games / AR. InstantMesh's improvements:

Stage 1: use Zero123++ to generate 6 view images
Stage 2: an LRM-style transformer predicts a FlexiCubes representation (a differentiable mesh representation)
Stage 3: extract a mesh from FlexiCubes and bake textures

Total time about 10 seconds, output is a standard mesh — one of the most popular open-source image-to-mesh pipelines in the community.

TripoSR (TripoAI + Stability 2024)

An open-source SOTA reproduction of LRM with a 0.5B-parameter model, producing results in 0.5 s on a single RTX 3090. Code, weights, and training data are all public.

LGM / GRM: 3DGS-Output Variants

LRM's Triplane → NeRF rendering speed is still bounded by volume rendering. LGM (Tang et al. 2024) and GRM (Xu et al. 2024) change this to:

Input: 4 view images (generated by Zero123++ or MVDream)
Transformer output: each pixel corresponds to one 3D Gaussian (so 4 views × H × W Gaussians)
Direct 3DGS rendering

Effect: 100 ms-level inference, rendering speed at 3DGS levels, quality comparable to LRM.

TriplaneGaussian (Zou et al. 2023)

A hybrid representation: the transformer outputs Triplane (used as a spatial feature), then an MLP decodes 3DGS parameters. Combines Triplane's inductive bias with 3DGS's rendering speed.

5.3 Structured-Latent School

The key turning point of late 2024: first diffuse a structured latent, then decode to specific 3D representations. This is the 3D version of the Stable Diffusion paradigm.

Trellis (Microsoft, 2024)

Trellis proposes Sparse Structured Latent (SLAT):

Encoder: encodes a 3D asset into latents on a sparse voxel grid, with each token carrying both occupancy and feature
Diffusion: flow matching on this sparse latent
Decoder: the same latent can be decoded into 3DGS / Mesh / NeRF / Radiance Field

The key innovation: single latent, multiple outputs. The same diffused latent can simultaneously serve games (mesh) and AR (3DGS).

Hunyuan3D 1.0 / 2.0 (Tencent, 2024-2025)

Hunyuan3D 2.0 splits 3D generation into two independent diffusion processes:

Shape Diffusion: diffuse on the latent of a Signed Distance Field (SDF) to generate geometry
Texture Diffusion: with geometry fixed, diffuse in UV space or multi-view images to generate texture

This decoupling allows each subtask to use its most suitable representation — SDF favors geometric diffusion, multi-view images favor high-frequency texture detail. Quality ranks among the top in early-2025 image-to-3D benchmarks.

3DTopia / 3DTopia-XL

3DTopia (Hong et al. 2024) uses Triplane latent + diffusion; 3DTopia-XL (Chen et al. 2024) scales up to 1B parameters, becoming the first fully diffusion-based text-to-3D large model.

5.4 Relationship Between Feed-Forward and Multi-View Diffusion

It is worth noting that most methods in 5.2-5.3 are actually two-stage:

graph LR
    A[Input image/text] --> B[Multi-view diffusion<br/>Zero123++ / MVDream]
    B --> C[N consistent view images]
    C --> D[Feed-forward reconstruction<br/>LRM / LGM / InstantMesh]
    D --> E[3D representation]

In practice, multi-view diffusion in Chapter 4 and feed-forward reconstruction in Chapter 5 are used in series, not as alternatives. The structured-latent school (Trellis / Hunyuan3D) is the first set of methods to fully bypass intermediate multi-view images, going directly image → latent → 3D.

6. Sparse-View Reconstruction

§ 5 deals with single-image input. In real-world scenarios, 2-10 sparse views are common (phone photos, surveillance, street view) — this is a gray zone between "geometric dense multi-view" and "generative single-image." Sparse-view reconstruction uses diffusion priors to fill in missing viewpoints, enabling traditional NeRF / 3DGS to work with few views.

6.1 Diffusion-Prior Inpainting

ReconFusion (Wu et al. 2023)

The core idea: augment NeRF training with a diffusion prior loss.

Render the current NeRF at unseen viewpoints $\mathbf{x}_{\text{novel}}$
Use a diffusion model $\epsilon_\phi$ to evaluate whether $\mathbf{x}_{\text{novel}}$ "looks realistic"
Add the SDS-style gradient as additional supervision in NeRF training

Effect: with only 3 views, achieves quality close to that with 50 views.

CAT3D (Gao et al. 2024)

A more cutting-edge idea: use video diffusion to generate hundreds of target viewpoints in one shot, then have NeRF / 3DGS train on these "hallucinated viewpoints":

Input: 1-3 images + hundreds of target camera trajectories
  ↓
Video diffusion (cross-view + cross-time attn)
  ↓
Output: hundreds of dense view images (consistency guaranteed by video model)
  ↓
Standard 3DGS training
  ↓
Reconstructed scene

CAT3D converts the "sparse-view" problem into "first generate dense views, then use classical geometric methods," achieving quality close to ground truth even in the 1-image setting.

ZeroNVS (Sargent et al. 2023)

Extends Zero-1-to-3 to scene level (not just objects). Trained on large-scale scene datasets (CO3D + RealEstate10K + ACID), it handles arbitrary indoor/outdoor scenes.

6.2 Feed-Forward Sparse-View → 3DGS

Another path: given 2-3 views, feed-forward directly to 3DGS, fully skipping per-scene optimization.

pixelSplat (Charatan et al. 2023)

Takes two images, transformer directly predicts 3D Gaussian parameters at each pixel (positions estimated from depth via epipolar constraints, other parameters predicted by transformer). 100 ms inference.

MVSplat (Chen et al. 2024)

A multi-view extension of pixelSplat, introducing cost volumes for depth estimation. SOTA on sparse-view novel view synthesis in 2024.

Splatter Image (Szymanowicz et al. 2023)

The simplest: single image → U-Net → "splatter image" (one Gaussian per pixel) → 3DGS. 1 s inference, for object-level 3D asset generation.

6.3 Differences from the LRM Family

Dimension	LRM family (§ 5)	pixelSplat / MVSplat (§ 6.2)
Input	Single image (+ multi-view diffusion as auxiliary)	2-3 sparse views (with camera parameters)
Training data	Object-level Objaverse	Scene-level RealEstate10K / ACID
Use case	3D asset generation	Scene novel view synthesis

It is worth noting: World Labs's Marble reportedly uses scene-level sparse-view reconstruction techniques similar to CAT3D / MVSplat (see § 9 Boundaries Discussion).

7. Datasets and Evaluation

7.1 Dataset Ecosystem

Objaverse Series: The Cornerstone of Object-Level 3D Data

Objaverse 1.0 (Deitke et al. 2022): 800K object-level 3D assets, scraped from Sketchfab
Objaverse-XL (Deitke et al. 2023): 10M 3D assets, expanded to GitHub, Thingiverse, Smithsonian, etc.

Almost all image-to-3D / text-to-3D training uses Objaverse rendered pairs (each asset rendered from 12-32 views).

Data Quality Issues

A substantial fraction of Objaverse assets are watermarks / 2D stickers / low-quality models. Strict filtering (aesthetic score, geometric quality, CLIP score thresholds) is typically performed before training, leaving roughly 30-50% as usable.

Scene-Level Datasets

Dataset	Content	Scale	Use
RealEstate10K	YouTube real-estate vlogs	80K videos	Scene-level NVS
ACID	YouTube aerial footage	13K videos	Outdoor scenes
CO3D (Reizenstein 2021)	Real objects	19K objects, 1.5M frames	Object-level NVS
MVImgNet (Yu 2023)	Real objects	220K objects	LRM training augmentation

Evaluation Benchmarks

GSO (Google Scanned Objects): 1000 real scanned objects, the standard benchmark for image-to-3D
DTU / NeRF Synthetic: traditional NVS benchmarks, often used in sparse-view settings

7.2 Evaluation Metrics

Geometric Quality

Chamfer Distance (CD): bidirectional nearest-neighbor distance averaged over point clouds, lower is better $$ \text{CD}(P, Q) = \frac{1}{|P|} \sum_{p \in P} \min_{q \in Q} |p - q|^2 + \frac{1}{|Q|} \sum_{q \in Q} \min_{p \in P} |q - p|^2 $$
F-score: harmonic mean of precision/recall at a fixed threshold $\tau$
IoU (Intersection over Union): voxel-level overlap

Visual Quality

LPIPS (Zhang 2018): perceptual similarity
PSNR / SSIM: traditional pixel-level metrics (less sensitive to generative 3D)
CLIP score: CLIP embedding cosine similarity between rendered images and prompts (only applicable to text-to-3D)
FID: distance between rendered distribution and real distribution

Multi-View Consistency

MV-CLIP: CLIP embedding variance across rendered views; lower is more consistent
Reprojection error: depth-based geometric consistency
User studies: multi-view failure modes (Janus) are hard to detect with automated metrics; user studies remain necessary to this day

7.3 The Janus Problem in Depth

The Janus problem is often overlooked in evaluation — a result with high CLIP score may still be a multi-faced monster.

Quantitative Detection

Render four views (front, back, left, right) and compute pairwise CLIP similarity
If all four views have high prompt similarity and high mutual similarity (suggesting they are nearly the same image) → likely Janus
Combined with human-annotated Janus / non-Janus ground truth, train a classifier

Effectiveness of Mitigations

Solution	Janus mitigation	Compute cost	Applicable scope
View-conditioned prompt	Medium	0	Any SDS method
Progressive view sampling	Medium	Low	SDS optimization
Debiased SDS / Perp-Neg	Medium-High	Low	SDS optimization
Multi-view diffusion priors (MVDream, etc.)	High	High (retraining the prior)	All downstream
Feed-forward reconstruction (LRM family)	High (supervised learning bypasses it from the root)	High (one-time training)	Object-level

The trend is moving from "post-hoc patches" (view prompts) to "consistency built into the prior" (MVDream / LRM).

8. Method Panorama

8.1 Comprehensive Comparison Table

Method	Year	School	Input	Representation	Training cost	Inference time	Janus	Open-source
DreamFusion	2022.09	SDS	text	NeRF	0	1.5 h	Severe	❌ (Imagen closed)
SJC	2022.12	SDS	text	NeRF	0	1 h	Severe	✅
Magic3D	2023.02	SDS two-stage	text	NeRF→Mesh	0	40 min	Severe	❌
Latent-NeRF	2022.11	SDS	text	latent NeRF	0	30 min	Severe	✅
Fantasia3D	2023.03	SDS decoupled	text	DMTet	0	1 h	Medium	✅
Zero-1-to-3	2023.03	Multi-view diffusion	image	NA	$$$	30 s	Medium	✅
ProlificDreamer	2023.05	VSD	text	NeRF	0 (online LoRA)	2-4 h	Medium	✅
MVDream	2023.08	Multi-view diffusion + SDS	text	NeRF	$$$	30 min	Mild	✅
SyncDreamer	2023.09	Multi-view diffusion	image	NeRF	$$$	5 min	Mild	✅
DreamGaussian	2023.09	SDS + 3DGS	image/text	3DGS→Mesh	0	3-5 min	Medium	✅
Wonder3D	2023.10	RGB+Normal MV diffusion	image	Mesh	$$$	3 min	Mild	✅
LRM	2023.11	Feed-forward	image	Triplane→NeRF	$$$$	5 s	Mild	Partial
ReconFusion	2023.12	Sparse-view + diffusion	3 images	NeRF	$$$	10 min	NA	❌
Zero123++	2023.10	Multi-view diffusion	image	NA	$$$	30 s	Mild	✅
InstantMesh	2024.03	MV diffusion + feed-forward	image	Mesh	$$$$	10 s	Mild	✅
TripoSR	2024.03	Feed-forward	image	Triplane→NeRF	$$$$	0.5 s	Mild	✅
LGM	2024.02	MV diffusion + feed-forward	image	3DGS	$$$$	5 s	Mild	✅
GRM	2024.03	MV diffusion + feed-forward	image	3DGS	$$$$	0.1 s	Mild	✅
pixelSplat	2023.12	Sparse-view feed-forward	2 images	3DGS	$$$$	0.1 s	NA	✅
MVSplat	2024.03	Sparse-view feed-forward	2-N images	3DGS	$$$$	0.1 s	NA	✅
SV3D	2024.03	Video diffusion prior	image	NA	$$$$	5 min	Mild	Partial
CAT3D	2024.05	Video diffusion + reconstruction	1-N images	NeRF/3DGS	$$$$	1 min	Mild	❌
Trellis	2024.12	Structured latent	image/text	SLAT→multiple	$$$$$	30 s	Mild	✅
Hunyuan3D 2.0	2025.01	shape + texture dual diffusion	image	Mesh	$$$$$	1 min	Mild	✅

Note: training cost $$ refers to the prior training cost (per-scene optimization is 0); $$$$ is on the order of $10K-100K GPU-hours.

8.2 Trend Evolution

graph LR
    Era1[2022-2023 H1<br/>SDS optimization<br/>per-scene slow] --> Era2[2023 H2<br/>Multi-view diffusion<br/>built-in consistency]
    Era2 --> Era3[Late 2023-2024<br/>Feed-forward<br/>seconds to 3D]
    Era3 --> Era4[2024-2025<br/>Structured latent<br/>unified framework]

Core patterns:

From optimization to learning: per-scene optimization (hours) → one-time training + feed-forward inference (seconds)
From implicit to structured: pure SDS (no 3D prior) → multi-view diffusion (implicit 3D prior) → Triplane / SLAT (explicit structured prior)
From single representation to multi-output: early methods bind to one 3D representation (NeRF), now latent diffusion can uniformly decode to mesh / 3DGS / NeRF
From objects to scenes: early methods could only do object-level (Objaverse training), now CAT3D / ZeroNVS have extended to scene-level

8.3 Selection Guide

Your need	Recommended method
Text-to-3D, quality first, can wait	ProlificDreamer (VSD)
Text-to-3D, speed first	DreamGaussian / MVDream + DreamFusion
Image-to-3D, engineering deployment	InstantMesh / TripoSR / Trellis
Image-to-3D, strongest open-source SOTA	Hunyuan3D 2.0
Sparse views (2-3 images) → 3D	MVSplat / CAT3D
Single image → navigable scene	CAT3D / ZeroNVS
Researching SDS improvements	DreamFusion + multi-view diffusion prior (e.g., MVDream)
Researching new representations	Trellis (structured latent is the dominant direction in 2024-2025)

9. Boundaries with Adjacent Topics

Generative 3D is technically related to several adjacent directions but differs in positioning. Clear boundaries help navigation between notes.

9.1 With Geometric 3D Reconstruction

See 3D视觉.md.

Dimension	Generative 3D (this article)	Geometric 3D (3D视觉.md)
Data assumption	Single image / text / few views	Dense multi-view (dozens of calibrated images)
Core mechanism	2D diffusion priors + rendering	Geometric constraints + volume rendering
Doesn't need	Camera calibration (in most cases)	/
Must have	Large-scale 2D training data	Many views + calibration
Representative methods	DreamFusion / LRM / Trellis	NeRF / 3DGS / SfM-MVS

Technical overlap: 3D Gaussian Splatting appears on both sides — the geometric school treats it as a reconstruction result, the generative school treats it as an output representation.

9.2 With Diffusion Methodology

See Diffusion.md.

The core methods of generative 3D (SDS / VSD / multi-view diffusion) are all built on the DDPM / DDIM / Score-Based framework. This article discusses only the application layer; for methodology, return to Diffusion.md.

Particularly relevant sections in Diffusion.md:

§ Score-Based View: key to understanding SDS as score distillation
§ Classifier-Free Guidance: choice of CFG weight in SDS practice
§ Latent Diffusion: understanding the latent path of Latent-NeRF / Trellis

9.3 With World Models

See 世界模型与视频生成.md (embodied perspective) and 空间智能与学习式仿真.md (path-of-AGI perspective).

Key distinctions:

Generative 3D outputs a static 3D representation, with no time dimension and no action input
World models predict $f(s_t, a_t) \to s_{t+1}$, fundamentally a temporal model
The two dimensions are orthogonal: generative 3D is "the inverse problem in spatial dimensions," and world models are "forward prediction in the temporal dimension"

9.4 With "Spatial Intelligence" as a Path

See 空间智能与学习式仿真.md.

Fei-Fei Li's World Labs / Marble — "persistent navigable 3D worlds" — engineering-wise uses § 6 sparse-view reconstruction (CAT3D / MVSplat-class techniques) + § 5 large-scale feed-forward reconstruction in this article. But its positioning is an argument about an AGI path — "3D is closer to human cognition than 2D / language" — rather than a specific method. The two articles are complementary:

This article: methodology handbook (how to do 3D generation/reconstruction with diffusion)
Spatial Intelligence: path-of-AGI / school comparison (why spatial intelligence matters)
World Models and Video Generation: how to use it in robotics (how to use for embodied AI)

9.5 With Generative Foundation Models

See Generative_Foundation_Model.md § Text-to-3D.

That note is an overview of multimodal generative foundation models (with one section each for T2I / T2V / T2A / T2-3D). Its § Text-to-3D is a condensed (~20-line) version of this article, suitable for quick reference. This article is the full development of that direction.

References

Key papers in chronological order:

SDS / VSD Optimization

Poole et al., "DreamFusion: Text-to-3D using 2D Diffusion," ICLR 2023.
Wang et al., "Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation," CVPR 2023.
Metzer et al., "Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures," CVPR 2023.
Lin et al., "Magic3D: High-Resolution Text-to-3D Content Creation," CVPR 2023.
Chen et al., "Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation," ICCV 2023.
Wang et al., "ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation," NeurIPS 2023.
Tang et al., "DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation," ICLR 2024.
Yi et al., "GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models," CVPR 2024.

Multi-View Diffusion

Liu et al., "Zero-1-to-3: Zero-shot One Image to 3D Object," ICCV 2023.
Shi et al., "MVDream: Multi-view Diffusion for 3D Generation," ICLR 2024.
Liu et al., "SyncDreamer: Generating Multiview-consistent Images from a Single-view Image," ICLR 2024.
Long et al., "Wonder3D: Single Image to 3D using Cross-Domain Diffusion," CVPR 2024.
Shi et al., "Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model," 2023.
Voleti et al., "SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion," ECCV 2024.

Feed-Forward Reconstruction

Hong et al., "LRM: Large Reconstruction Model for Single Image to 3D," ICLR 2024.
Xu et al., "InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models," 2024.
TripoSR (Stability AI + Tripo, 2024). Technical report.
Tang et al., "LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation," ECCV 2024.
Xu et al., "GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation," ECCV 2024.
Zou et al., "TriplaneGaussian: Fast and Generalizable Single-View 3D Reconstruction with Transformers," CVPR 2024.

Structured-Latent School

Xiang et al., "Structured 3D Latents for Scalable and Versatile 3D Generation (Trellis)," 2024.
Hunyuan3D Team, "Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation," 2025.
Hong et al., "3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors," 2024.

Sparse-View Reconstruction

Wu et al., "ReconFusion: 3D Reconstruction with Diffusion Priors," CVPR 2024.
Gao et al., "CAT3D: Create Anything in 3D with Multi-View Diffusion Models," NeurIPS 2024.
Sargent et al., "ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image," CVPR 2024.
Charatan et al., "pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction," CVPR 2024.
Chen et al., "MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images," ECCV 2024.
Szymanowicz et al., "Splatter Image: Ultra-Fast Single-View 3D Reconstruction," CVPR 2024.

Datasets

Deitke et al., "Objaverse: A Universe of Annotated 3D Objects," CVPR 2023.
Deitke et al., "Objaverse-XL: A Universe of 10M+ 3D Objects," NeurIPS 2023.
Reizenstein et al., "Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction," ICCV 2021.

Representation	Data structure	Rendering	Differentiable	Memory	Diffusion-friendly
NeRF	MLP implicit field \(f_\theta(\mathbf{x}, \mathbf{d}) \to (c, \sigma)\)	Volume rendering (ray marching)	✅	Small (network weights)	❌ (continuous function hard to discretize)
3D Gaussian Splatting	Explicit Gaussian ellipsoid set \(\{\boldsymbol{\mu}_i, \boldsymbol{\Sigma}_i, \alpha_i, \mathbf{c}_i\}\)	Differentiable rasterization	✅	Medium (\(10^5\)-\(10^6\) Gaussians)	⚠️ (unordered set hard to diffuse, needs structuring)
Mesh	Vertices + faces + texture	Rasterization	✅ (DMTet/FlexiCubes)	Small	❌ (topology not differentiable)
Triplane	Three orthogonal 2D feature maps + decoder MLP	Volume rendering	✅	Medium	✅ (2D feature maps natural for diffusion)
Voxel	Regular 3D grid	Volume rendering	✅	Large (\(O(N^3)\))	✅ (3D conv handles naturally)

	SDS	VSD
Subtracted "baseline"	Random noise \(\epsilon\)	Score of current particle distribution \(\epsilon_{\phi'}\)
Goal	Mode-seeking	Mode-coverage
Diversity	Low (saturated, averaged)	High
CFG weight	Often 100+	7.5 (consistent with 2D generation)
Training cost	Medium	High (online LoRA training)