3D Vision

Overview

3D vision aims to understand and reconstruct the three-dimensional world from 2D images or other sensor data. In recent years, Neural Radiance Fields (NeRF) and 3D Gaussian Splatting have revolutionized novel view synthesis, while point cloud processing and monocular depth estimation provide core capabilities for autonomous driving, robotics, and AR/VR.

graph TD
    A[3D Vision] --> B[Novel View Synthesis]
    A --> C[3D Representation Learning]
    A --> D[Depth Estimation]
    A --> E[3D Object Detection]

    B --> B1[NeRF]
    B --> B2[3D Gaussian Splatting]
    B --> B3[Instant-NGP]

    C --> C1[PointNet/PointNet++]
    C --> C2[Voxel Methods]
    C --> C3[Mesh Methods]

    D --> D1[Monocular Depth]
    D --> D2[Stereo Matching]
    D --> D3[Multi-View Stereo]

    E --> E1[Indoor 3D Detection]
    E --> E2[Autonomous Driving 3D Detection]

1. 3D Representations

1.1 Common 3D Data Representations

Representation	Description	Pros	Cons
Point Cloud	Unordered set of 3D points \((x,y,z)\)	Lightweight, flexible	No topology
Voxel	3D grid, analogous to 3D pixels	Regular structure, easy convolution	High memory \(O(n^3)\)
Mesh	Vertices + faces	Precise surface, rendering-friendly	Topology changes difficult
Implicit	Continuous function \(f(x,y,z) \to\) attributes	Arbitrary resolution	Requires sampling
Gaussian	Collection of 3D Gaussian ellipsoids	Fast rendering, differentiable	Large memory

1.2 Coordinate Systems and Transformations

The core mathematics of camera models:

Pinhole Model:

\[ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K [R | t] \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \]

where \(K\) is the intrinsic matrix and \([R|t]\) is the extrinsic (rotation + translation).

Intrinsic Matrix:

\[ K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \]

2. Neural Radiance Fields (NeRF)

2.1 Core Idea

NeRF (Mildenhall et al., 2020) uses an MLP to implicitly represent a 3D scene:

\[ F_\Theta: (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma) \]

\(\mathbf{x} = (x, y, z)\): 3D spatial position
\(\mathbf{d} = (\theta, \phi)\): viewing direction
\(\mathbf{c} = (r, g, b)\): color
\(\sigma\): volume density

2.2 Volume Rendering

Integrating along ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\):

\[ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt \]

where the transmittance is:

\[ T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right) \]

Discrete Approximation (stratified sampling in practice):

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}_i \]

\[ T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \]

where \(\delta_i = t_{i+1} - t_i\) is the distance between adjacent sample points.

2.3 Positional Encoding

MLPs struggle to learn high-frequency details, so positional encoding maps low-dimensional inputs to a higher-dimensional space:

\[ \gamma(p) = \left(\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right) \]

For position \(\mathbf{x}\): \(L=10\) (60 dimensions); for direction \(\mathbf{d}\): \(L=4\) (24 dimensions).

2.4 Training Pipeline

Sample rays from images with known camera poses
Perform stratified sampling along rays (coarse + fine)
MLP predicts \((\mathbf{c}, \sigma)\) for each sample point
Volume rendering produces pixel color
Compute MSE loss against ground truth pixels

2.5 Limitations and Improvements

Method	Improvement	Key Innovation
Instant-NGP	Speed	Multi-resolution hash encoding, training from hours to seconds
Mip-NeRF	Anti-aliasing	Cone tracing replaces ray tracing, integrated PE
Mip-NeRF 360	Unbounded scenes	Space contraction, regularization
TensoRF	Efficiency	Tensor decomposition for radiance fields
Zip-NeRF	Combined	Merges Instant-NGP and Mip-NeRF 360
NeRF in the Wild	Appearance variation	Handles lighting and transient objects

3. 3D Gaussian Splatting

3.1 Core Idea

3DGS (Kerbl et al., 2023) uses millions of 3D Gaussian ellipsoids to explicitly represent scenes, achieving real-time rendering through differentiable rasterization.

Each 3D Gaussian is defined by:

Position \(\boldsymbol{\mu} \in \mathbb{R}^3\): Gaussian center
Covariance \(\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}\): shape and orientation
Opacity \(\alpha \in [0,1]\)
Color: Spherical harmonics (SH) coefficients

Gaussian function:

\[ G(\mathbf{x}) = \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) \]

3.2 Differentiable Rasterization

Projecting 3D Gaussians to the 2D image plane:

\[ \boldsymbol{\Sigma}' = J W \boldsymbol{\Sigma} W^\top J^\top \]

where \(W\) is the view transformation matrix and \(J\) is the Jacobian of the projection.

Alpha Compositing (front-to-back ordering):

\[ C = \sum_{i \in \mathcal{N}} \mathbf{c}_i \alpha_i \prod_{j=1}^{i-1}(1 - \alpha_j) \]

3.3 Adaptive Density Control

Dynamically adjusting the number of Gaussians during training:

Clone: Large gradient but small scale → under-reconstructed regions
Split: Large gradient and large scale → over-reconstructed regions
Prune: Remove nearly transparent (\(\alpha \approx 0\)) Gaussians

3.4 NeRF vs 3DGS

Feature	NeRF	3D Gaussian Splatting
Representation	Implicit (MLP)	Explicit (Gaussian ellipsoids)
Rendering	Volume rendering (ray marching)	Rasterization (splatting)
Training speed	Slow (hours)	Fast (minutes)
Rendering speed	Slow (seconds)	Real-time (>100 FPS)
Memory	Small (network weights)	Large (millions of Gaussians)
Editability	Difficult	Easy (direct Gaussian manipulation)
Anti-aliasing	Requires Mip-NeRF	Naturally smooth

4. Point Cloud Processing

4.1 PointNet

Core Challenge: Point clouds are unordered sets; the network must be permutation invariant.

PointNet (Qi et al., 2017) solution:

\[ f(\{x_1, \ldots, x_n\}) = g(h(x_1), \ldots, h(x_n)) \]

\(h\): per-point MLP (shared weights)
\(g\): symmetric function (max pooling)

Architecture:

Input transform (T-Net): learn \(3 \times 3\) transformation matrix
Per-point feature extraction: MLP \((3 \to 64 \to 128 \to 1024)\)
Global feature: Max Pooling
Classification/segmentation heads

4.2 PointNet++

Improvement: PointNet lacks local structure information.

Hierarchical Point Set Learning:

Sampling Layer: Farthest Point Sampling (FPS) selects centroids
Grouping Layer: Ball Query finds neighborhood points
Feature Extraction: Apply PointNet to each local region

\[ \text{PointNet++} = \text{FPS} + \text{Ball Query} + \text{PointNet}(\text{local}) \]

4.3 Subsequent Developments

Method	Innovation
DGCNN	Dynamic graph convolution, EdgeConv
Point Transformer	Self-attention for point clouds
PCT	Point Cloud Transformer
PointNeXt	Revisiting the importance of training strategies

5. Monocular Depth Estimation

5.1 Problem Definition

Predicting per-pixel depth from a single RGB image is an ill-posed problem due to scale ambiguity.

5.2 MiDaS

MiDaS (Ranftl et al., 2020):

Mixed training on multiple datasets
Uses scale- and shift-invariant loss:

\[ \mathcal{L} = \frac{1}{M} \sum_{i=1}^{M} \rho\left(\hat{d}_i^* - d_i^*\right) \]

where \(\hat{d}^*\) and \(d^*\) are the normalized predicted and ground truth depths respectively.

5.3 Depth Anything

Depth Anything (Yang et al., 2024):

Strong visual encoder based on DINOv2
Self-training on large-scale unlabeled data
V2 uses synthetic data for metric depth

Key Design:

Train teacher model on labeled data
Teacher generates pseudo-labels for 62M unlabeled images
Student trains on labeled + pseudo-labeled data
Strong data augmentation prevents student shortcuts

5.4 Metric Depth vs Relative Depth

Type	Output	Applications	Representative Methods
Relative Depth	Ordinal relations	Image editing, refocusing	MiDaS, Depth Anything
Metric Depth	Absolute metric depth	Robotics, autonomous driving	ZoeDepth, Metric3D

6. Frontiers in 3D Vision

6.1 Large-Scale 3D Reconstruction

DUSt3R / MASt3R: 3D reconstruction without camera poses
LRM: Large Reconstruction Model, single image to 3D
One-2-3-45++: Single image to 3D mesh

6.2 4D Scenes (Dynamic 3D)

Dynamic 3DGS: Gaussian splatting with temporal dimension
4D Gaussian Splatting: Spatiotemporal Gaussian representation

6.3 Generative 3D

Zero-1-to-3: Zero-shot novel view synthesis
DreamFusion: Text-to-3D via Score Distillation Sampling (SDS)

\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t,\epsilon}\left[w(t)(\hat{\epsilon}_\phi(z_t; y, t) - \epsilon) \frac{\partial g(\theta)}{\partial \theta}\right] \]

Magic3D: Two-stage text-to-3D (coarse-to-fine)

7. Summary

Task	Traditional	Deep Learning	Latest
Novel View Synthesis	Light fields, IBR	NeRF	3D Gaussian Splatting
3D Reconstruction	SfM + MVS	Learned MVS	DUSt3R, LRM
Depth Estimation	Stereo matching	Supervised learning	Depth Anything V2
Point Cloud Understanding	Handcrafted features	PointNet	Point Transformer V3
3D Generation	-	3D GAN	DreamFusion, 3DGS generation

Key Trends:

From implicit to explicit representations (NeRF → 3DGS)
From small-scale to large foundation models (Depth Anything, DUSt3R)
Distilling 2D generative models to 3D (SDS paradigm)
Real-time, editable, and interactive

References

Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," ECCV 2020
Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering," SIGGRAPH 2023
Qi et al., "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation," CVPR 2017
Ranftl et al., "Towards Robust Monocular Depth Estimation," TPAMI 2020
Yang et al., "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," CVPR 2024