Skip to content

3D Vision

Overview

3D vision aims to understand and reconstruct the three-dimensional world from 2D images or other sensor data. In recent years, Neural Radiance Fields (NeRF) and 3D Gaussian Splatting have revolutionized novel view synthesis, while point cloud processing and monocular depth estimation provide core capabilities for autonomous driving, robotics, and AR/VR.

graph TD
    A[3D Vision] --> B[Novel View Synthesis]
    A --> C[3D Representation Learning]
    A --> D[Depth Estimation]
    A --> E[3D Object Detection]

    B --> B1[NeRF]
    B --> B2[3D Gaussian Splatting]
    B --> B3[Instant-NGP]

    C --> C1[PointNet/PointNet++]
    C --> C2[Voxel Methods]
    C --> C3[Mesh Methods]

    D --> D1[Monocular Depth]
    D --> D2[Stereo Matching]
    D --> D3[Multi-View Stereo]

    E --> E1[Indoor 3D Detection]
    E --> E2[Autonomous Driving 3D Detection]

1. 3D Representations

1.1 Common 3D Data Representations

Representation Description Pros Cons
Point Cloud Unordered set of 3D points \((x,y,z)\) Lightweight, flexible No topology
Voxel 3D grid, analogous to 3D pixels Regular structure, easy convolution High memory \(O(n^3)\)
Mesh Vertices + faces Precise surface, rendering-friendly Topology changes difficult
Implicit Continuous function \(f(x,y,z) \to\) attributes Arbitrary resolution Requires sampling
Gaussian Collection of 3D Gaussian ellipsoids Fast rendering, differentiable Large memory

1.2 Coordinate Systems and Transformations

The core mathematics of camera models:

Pinhole Model:

\[ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K [R | t] \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \]

where \(K\) is the intrinsic matrix and \([R|t]\) is the extrinsic (rotation + translation).

Intrinsic Matrix:

\[ K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \]

2. Neural Radiance Fields (NeRF)

2.1 Core Idea

NeRF (Mildenhall et al., 2020) uses an MLP to implicitly represent a 3D scene:

\[ F_\Theta: (\mathbf{x}, \mathbf{d}) \to (\mathbf{c}, \sigma) \]
  • \(\mathbf{x} = (x, y, z)\): 3D spatial position
  • \(\mathbf{d} = (\theta, \phi)\): viewing direction
  • \(\mathbf{c} = (r, g, b)\): color
  • \(\sigma\): volume density

2.2 Volume Rendering

Integrating along ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\):

\[ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt \]

where the transmittance is:

\[ T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right) \]

Discrete Approximation (stratified sampling in practice):

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) \mathbf{c}_i \]
\[ T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \]

where \(\delta_i = t_{i+1} - t_i\) is the distance between adjacent sample points.

2.3 Positional Encoding

MLPs struggle to learn high-frequency details, so positional encoding maps low-dimensional inputs to a higher-dimensional space:

\[ \gamma(p) = \left(\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right) \]

For position \(\mathbf{x}\): \(L=10\) (60 dimensions); for direction \(\mathbf{d}\): \(L=4\) (24 dimensions).

2.4 Training Pipeline

  1. Sample rays from images with known camera poses
  2. Perform stratified sampling along rays (coarse + fine)
  3. MLP predicts \((\mathbf{c}, \sigma)\) for each sample point
  4. Volume rendering produces pixel color
  5. Compute MSE loss against ground truth pixels

2.5 Limitations and Improvements

Method Improvement Key Innovation
Instant-NGP Speed Multi-resolution hash encoding, training from hours to seconds
Mip-NeRF Anti-aliasing Cone tracing replaces ray tracing, integrated PE
Mip-NeRF 360 Unbounded scenes Space contraction, regularization
TensoRF Efficiency Tensor decomposition for radiance fields
Zip-NeRF Combined Merges Instant-NGP and Mip-NeRF 360
NeRF in the Wild Appearance variation Handles lighting and transient objects

3. 3D Gaussian Splatting

3.1 Core Idea

3DGS (Kerbl et al., 2023) uses millions of 3D Gaussian ellipsoids to explicitly represent scenes, achieving real-time rendering through differentiable rasterization.

Each 3D Gaussian is defined by:

  • Position \(\boldsymbol{\mu} \in \mathbb{R}^3\): Gaussian center
  • Covariance \(\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}\): shape and orientation
  • Opacity \(\alpha \in [0,1]\)
  • Color: Spherical harmonics (SH) coefficients

Gaussian function:

\[ G(\mathbf{x}) = \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) \]

3.2 Differentiable Rasterization

Projecting 3D Gaussians to the 2D image plane:

\[ \boldsymbol{\Sigma}' = J W \boldsymbol{\Sigma} W^\top J^\top \]

where \(W\) is the view transformation matrix and \(J\) is the Jacobian of the projection.

Alpha Compositing (front-to-back ordering):

\[ C = \sum_{i \in \mathcal{N}} \mathbf{c}_i \alpha_i \prod_{j=1}^{i-1}(1 - \alpha_j) \]

3.3 Adaptive Density Control

Dynamically adjusting the number of Gaussians during training:

  • Clone: Large gradient but small scale → under-reconstructed regions
  • Split: Large gradient and large scale → over-reconstructed regions
  • Prune: Remove nearly transparent (\(\alpha \approx 0\)) Gaussians

3.4 NeRF vs 3DGS

Feature NeRF 3D Gaussian Splatting
Representation Implicit (MLP) Explicit (Gaussian ellipsoids)
Rendering Volume rendering (ray marching) Rasterization (splatting)
Training speed Slow (hours) Fast (minutes)
Rendering speed Slow (seconds) Real-time (>100 FPS)
Memory Small (network weights) Large (millions of Gaussians)
Editability Difficult Easy (direct Gaussian manipulation)
Anti-aliasing Requires Mip-NeRF Naturally smooth

4. Point Cloud Processing

4.1 PointNet

Core Challenge: Point clouds are unordered sets; the network must be permutation invariant.

PointNet (Qi et al., 2017) solution:

\[ f(\{x_1, \ldots, x_n\}) = g(h(x_1), \ldots, h(x_n)) \]
  • \(h\): per-point MLP (shared weights)
  • \(g\): symmetric function (max pooling)

Architecture:

  1. Input transform (T-Net): learn \(3 \times 3\) transformation matrix
  2. Per-point feature extraction: MLP \((3 \to 64 \to 128 \to 1024)\)
  3. Global feature: Max Pooling
  4. Classification/segmentation heads

4.2 PointNet++

Improvement: PointNet lacks local structure information.

Hierarchical Point Set Learning:

  1. Sampling Layer: Farthest Point Sampling (FPS) selects centroids
  2. Grouping Layer: Ball Query finds neighborhood points
  3. Feature Extraction: Apply PointNet to each local region
\[ \text{PointNet++} = \text{FPS} + \text{Ball Query} + \text{PointNet}(\text{local}) \]

4.3 Subsequent Developments

Method Innovation
DGCNN Dynamic graph convolution, EdgeConv
Point Transformer Self-attention for point clouds
PCT Point Cloud Transformer
PointNeXt Revisiting the importance of training strategies

5. Monocular Depth Estimation

5.1 Problem Definition

Predicting per-pixel depth from a single RGB image is an ill-posed problem due to scale ambiguity.

5.2 MiDaS

MiDaS (Ranftl et al., 2020):

  • Mixed training on multiple datasets
  • Uses scale- and shift-invariant loss:
\[ \mathcal{L} = \frac{1}{M} \sum_{i=1}^{M} \rho\left(\hat{d}_i^* - d_i^*\right) \]

where \(\hat{d}^*\) and \(d^*\) are the normalized predicted and ground truth depths respectively.

5.3 Depth Anything

Depth Anything (Yang et al., 2024):

  • Strong visual encoder based on DINOv2
  • Self-training on large-scale unlabeled data
  • V2 uses synthetic data for metric depth

Key Design:

  1. Train teacher model on labeled data
  2. Teacher generates pseudo-labels for 62M unlabeled images
  3. Student trains on labeled + pseudo-labeled data
  4. Strong data augmentation prevents student shortcuts

5.4 Metric Depth vs Relative Depth

Type Output Applications Representative Methods
Relative Depth Ordinal relations Image editing, refocusing MiDaS, Depth Anything
Metric Depth Absolute metric depth Robotics, autonomous driving ZoeDepth, Metric3D

6. Frontiers in 3D Vision

6.1 Large-Scale 3D Reconstruction

  • DUSt3R / MASt3R: 3D reconstruction without camera poses
  • LRM: Large Reconstruction Model, single image to 3D
  • One-2-3-45++: Single image to 3D mesh

6.2 4D Scenes (Dynamic 3D)

  • Dynamic 3DGS: Gaussian splatting with temporal dimension
  • 4D Gaussian Splatting: Spatiotemporal Gaussian representation

6.3 Generative 3D

  • Zero-1-to-3: Zero-shot novel view synthesis
  • DreamFusion: Text-to-3D via Score Distillation Sampling (SDS)
\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t,\epsilon}\left[w(t)(\hat{\epsilon}_\phi(z_t; y, t) - \epsilon) \frac{\partial g(\theta)}{\partial \theta}\right] \]
  • Magic3D: Two-stage text-to-3D (coarse-to-fine)

7. Summary

Task Traditional Deep Learning Latest
Novel View Synthesis Light fields, IBR NeRF 3D Gaussian Splatting
3D Reconstruction SfM + MVS Learned MVS DUSt3R, LRM
Depth Estimation Stereo matching Supervised learning Depth Anything V2
Point Cloud Understanding Handcrafted features PointNet Point Transformer V3
3D Generation - 3D GAN DreamFusion, 3DGS generation

Key Trends:

  1. From implicit to explicit representations (NeRF → 3DGS)
  2. From small-scale to large foundation models (Depth Anything, DUSt3R)
  3. Distilling 2D generative models to 3D (SDS paradigm)
  4. Real-time, editable, and interactive

References

  • Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," ECCV 2020
  • Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering," SIGGRAPH 2023
  • Qi et al., "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation," CVPR 2017
  • Ranftl et al., "Towards Robust Monocular Depth Estimation," TPAMI 2020
  • Yang et al., "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," CVPR 2024

评论 #