3D Vision
Overview
3D vision aims to understand and reconstruct the three-dimensional world from 2D images or other sensor data. In recent years, Neural Radiance Fields (NeRF) and 3D Gaussian Splatting have revolutionized novel view synthesis, while point cloud processing and monocular depth estimation provide core capabilities for autonomous driving, robotics, and AR/VR.
graph TD
A[3D Vision] --> B[Novel View Synthesis]
A --> C[3D Representation Learning]
A --> D[Depth Estimation]
A --> E[3D Object Detection]
B --> B1[NeRF]
B --> B2[3D Gaussian Splatting]
B --> B3[Instant-NGP]
C --> C1[PointNet/PointNet++]
C --> C2[Voxel Methods]
C --> C3[Mesh Methods]
D --> D1[Monocular Depth]
D --> D2[Stereo Matching]
D --> D3[Multi-View Stereo]
E --> E1[Indoor 3D Detection]
E --> E2[Autonomous Driving 3D Detection]
1. 3D Representations
1.1 Common 3D Data Representations
| Representation | Description | Pros | Cons |
|---|---|---|---|
| Point Cloud | Unordered set of 3D points \((x,y,z)\) | Lightweight, flexible | No topology |
| Voxel | 3D grid, analogous to 3D pixels | Regular structure, easy convolution | High memory \(O(n^3)\) |
| Mesh | Vertices + faces | Precise surface, rendering-friendly | Topology changes difficult |
| Implicit | Continuous function \(f(x,y,z) \to\) attributes | Arbitrary resolution | Requires sampling |
| Gaussian | Collection of 3D Gaussian ellipsoids | Fast rendering, differentiable | Large memory |
1.2 Coordinate Systems and Transformations
The core mathematics of camera models:
Pinhole Model:
where \(K\) is the intrinsic matrix and \([R|t]\) is the extrinsic (rotation + translation).
Intrinsic Matrix:
2. Neural Radiance Fields (NeRF)
2.1 Core Idea
NeRF (Mildenhall et al., 2020) uses an MLP to implicitly represent a 3D scene:
- \(\mathbf{x} = (x, y, z)\): 3D spatial position
- \(\mathbf{d} = (\theta, \phi)\): viewing direction
- \(\mathbf{c} = (r, g, b)\): color
- \(\sigma\): volume density
2.2 Volume Rendering
Integrating along ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\):
where the transmittance is:
Discrete Approximation (stratified sampling in practice):
where \(\delta_i = t_{i+1} - t_i\) is the distance between adjacent sample points.
2.3 Positional Encoding
MLPs struggle to learn high-frequency details, so positional encoding maps low-dimensional inputs to a higher-dimensional space:
For position \(\mathbf{x}\): \(L=10\) (60 dimensions); for direction \(\mathbf{d}\): \(L=4\) (24 dimensions).
2.4 Training Pipeline
- Sample rays from images with known camera poses
- Perform stratified sampling along rays (coarse + fine)
- MLP predicts \((\mathbf{c}, \sigma)\) for each sample point
- Volume rendering produces pixel color
- Compute MSE loss against ground truth pixels
2.5 Limitations and Improvements
| Method | Improvement | Key Innovation |
|---|---|---|
| Instant-NGP | Speed | Multi-resolution hash encoding, training from hours to seconds |
| Mip-NeRF | Anti-aliasing | Cone tracing replaces ray tracing, integrated PE |
| Mip-NeRF 360 | Unbounded scenes | Space contraction, regularization |
| TensoRF | Efficiency | Tensor decomposition for radiance fields |
| Zip-NeRF | Combined | Merges Instant-NGP and Mip-NeRF 360 |
| NeRF in the Wild | Appearance variation | Handles lighting and transient objects |
3. 3D Gaussian Splatting
3.1 Core Idea
3DGS (Kerbl et al., 2023) uses millions of 3D Gaussian ellipsoids to explicitly represent scenes, achieving real-time rendering through differentiable rasterization.
Each 3D Gaussian is defined by:
- Position \(\boldsymbol{\mu} \in \mathbb{R}^3\): Gaussian center
- Covariance \(\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}\): shape and orientation
- Opacity \(\alpha \in [0,1]\)
- Color: Spherical harmonics (SH) coefficients
Gaussian function:
3.2 Differentiable Rasterization
Projecting 3D Gaussians to the 2D image plane:
where \(W\) is the view transformation matrix and \(J\) is the Jacobian of the projection.
Alpha Compositing (front-to-back ordering):
3.3 Adaptive Density Control
Dynamically adjusting the number of Gaussians during training:
- Clone: Large gradient but small scale → under-reconstructed regions
- Split: Large gradient and large scale → over-reconstructed regions
- Prune: Remove nearly transparent (\(\alpha \approx 0\)) Gaussians
3.4 NeRF vs 3DGS
| Feature | NeRF | 3D Gaussian Splatting |
|---|---|---|
| Representation | Implicit (MLP) | Explicit (Gaussian ellipsoids) |
| Rendering | Volume rendering (ray marching) | Rasterization (splatting) |
| Training speed | Slow (hours) | Fast (minutes) |
| Rendering speed | Slow (seconds) | Real-time (>100 FPS) |
| Memory | Small (network weights) | Large (millions of Gaussians) |
| Editability | Difficult | Easy (direct Gaussian manipulation) |
| Anti-aliasing | Requires Mip-NeRF | Naturally smooth |
4. Point Cloud Processing
4.1 PointNet
Core Challenge: Point clouds are unordered sets; the network must be permutation invariant.
PointNet (Qi et al., 2017) solution:
- \(h\): per-point MLP (shared weights)
- \(g\): symmetric function (max pooling)
Architecture:
- Input transform (T-Net): learn \(3 \times 3\) transformation matrix
- Per-point feature extraction: MLP \((3 \to 64 \to 128 \to 1024)\)
- Global feature: Max Pooling
- Classification/segmentation heads
4.2 PointNet++
Improvement: PointNet lacks local structure information.
Hierarchical Point Set Learning:
- Sampling Layer: Farthest Point Sampling (FPS) selects centroids
- Grouping Layer: Ball Query finds neighborhood points
- Feature Extraction: Apply PointNet to each local region
4.3 Subsequent Developments
| Method | Innovation |
|---|---|
| DGCNN | Dynamic graph convolution, EdgeConv |
| Point Transformer | Self-attention for point clouds |
| PCT | Point Cloud Transformer |
| PointNeXt | Revisiting the importance of training strategies |
5. Monocular Depth Estimation
5.1 Problem Definition
Predicting per-pixel depth from a single RGB image is an ill-posed problem due to scale ambiguity.
5.2 MiDaS
MiDaS (Ranftl et al., 2020):
- Mixed training on multiple datasets
- Uses scale- and shift-invariant loss:
where \(\hat{d}^*\) and \(d^*\) are the normalized predicted and ground truth depths respectively.
5.3 Depth Anything
Depth Anything (Yang et al., 2024):
- Strong visual encoder based on DINOv2
- Self-training on large-scale unlabeled data
- V2 uses synthetic data for metric depth
Key Design:
- Train teacher model on labeled data
- Teacher generates pseudo-labels for 62M unlabeled images
- Student trains on labeled + pseudo-labeled data
- Strong data augmentation prevents student shortcuts
5.4 Metric Depth vs Relative Depth
| Type | Output | Applications | Representative Methods |
|---|---|---|---|
| Relative Depth | Ordinal relations | Image editing, refocusing | MiDaS, Depth Anything |
| Metric Depth | Absolute metric depth | Robotics, autonomous driving | ZoeDepth, Metric3D |
6. Frontiers in 3D Vision
6.1 Large-Scale 3D Reconstruction
- DUSt3R / MASt3R: 3D reconstruction without camera poses
- LRM: Large Reconstruction Model, single image to 3D
- One-2-3-45++: Single image to 3D mesh
6.2 4D Scenes (Dynamic 3D)
- Dynamic 3DGS: Gaussian splatting with temporal dimension
- 4D Gaussian Splatting: Spatiotemporal Gaussian representation
6.3 Generative 3D
- Zero-1-to-3: Zero-shot novel view synthesis
- DreamFusion: Text-to-3D via Score Distillation Sampling (SDS)
- Magic3D: Two-stage text-to-3D (coarse-to-fine)
7. Summary
| Task | Traditional | Deep Learning | Latest |
|---|---|---|---|
| Novel View Synthesis | Light fields, IBR | NeRF | 3D Gaussian Splatting |
| 3D Reconstruction | SfM + MVS | Learned MVS | DUSt3R, LRM |
| Depth Estimation | Stereo matching | Supervised learning | Depth Anything V2 |
| Point Cloud Understanding | Handcrafted features | PointNet | Point Transformer V3 |
| 3D Generation | - | 3D GAN | DreamFusion, 3DGS generation |
Key Trends:
- From implicit to explicit representations (NeRF → 3DGS)
- From small-scale to large foundation models (Depth Anything, DUSt3R)
- Distilling 2D generative models to 3D (SDS paradigm)
- Real-time, editable, and interactive
References
- Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," ECCV 2020
- Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering," SIGGRAPH 2023
- Qi et al., "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation," CVPR 2017
- Ranftl et al., "Towards Robust Monocular Depth Estimation," TPAMI 2020
- Yang et al., "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," CVPR 2024