Stereo Vision and Structured Light
Overview
Stereo vision and structured light are two major depth sensing technologies. Stereo vision mimics human binocular vision to compute depth through disparity, while structured light measures depth by projecting known patterns and observing their deformation. This section provides an in-depth look at the mathematical principles, engineering implementations, and performance comparisons.
Stereo Vision Principles
Basic Geometry
A stereo camera consists of two horizontally aligned cameras that compute depth through triangulation:
Left camera O_L Right camera O_R
*-------- B --------*
|\ /|
| \ / |
| \ / |
| \ P(X,Y,Z) |
| \ * / |
| \ / \ / |
|------*---*--------| Image plane
x_L x_R
Disparity and Depth
Disparity is the horizontal position difference of the same point in left and right images:
From similar triangles:
Therefore depth is:
Where:
- \(Z\): Depth (meters)
- \(f\): Focal length (pixels)
- \(B\): Baseline length (meters)
- \(d\): Disparity (pixels)
Depth Resolution
Minimum depth resolution (depth difference corresponding to adjacent disparity values):
When \(\Delta d = 1\) (integer pixel disparity), depth resolution grows with the square of distance.
Depth Resolution Calculation
Parameters: \(f=400\text{px}\), \(B=0.12\text{m}\) (ZED 2i)
- At 1m: \(\Delta Z = \frac{1^2}{400 \times 0.12} = 0.021\text{m} = 2.1\text{cm}\)
- At 5m: \(\Delta Z = \frac{25}{48} = 0.52\text{m} = 52\text{cm}\)
Sub-pixel disparity (0.1px accuracy) can improve resolution by 10x.
Depth Error Analysis
Three paths to improve depth accuracy:
- Increase baseline \(B\): But the near-range blind zone also increases (\(Z_{\min} \approx \frac{fB}{d_{\max}}\))
- Increase focal length \(f\): But FOV decreases
- Reduce disparity error \(\sigma_d\): Better matching algorithms, sub-pixel interpolation
Epipolar Geometry
Epipolar Constraint
For a calibrated stereo system, a point in the left image must lie on the epipolar line in the right image:
Where \(\mathbf{F}\) is the Fundamental Matrix.
For rectified stereo, epipolar lines become horizontal, reducing the search from 2D to 1D:
Left image: Right image:
+------------------+ +------------------+
| *P_L | | *P_R |
|----epipolar line--| -> |----epipolar line--|
| | | |
+------------------+ +------------------+
P_L and P_R are on the same row (same y coordinate)
Rectification
Stereo rectification makes the two camera image planes coplanar and row-aligned:
import cv2
import numpy as np
# Known calibration parameters
K_left = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
K_right = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
D_left = np.array([k1, k2, p1, p2, k3])
D_right = np.array([k1, k2, p1, p2, k3])
R = rotation_matrix_3x3 # Right camera rotation relative to left
T = translation_vector # Right camera translation relative to left
# Stereo rectification
R1, R2, P1, P2, Q, roi1, roi2 = cv2.stereoRectify(
K_left, D_left, K_right, D_right,
image_size, R, T,
flags=cv2.CALIB_ZERO_DISPARITY,
alpha=0
)
# Generate mapping tables
map1_left, map2_left = cv2.initUndistortRectifyMap(
K_left, D_left, R1, P1, image_size, cv2.CV_16SC2
)
map1_right, map2_right = cv2.initUndistortRectifyMap(
K_right, D_right, R2, P2, image_size, cv2.CV_16SC2
)
# Apply rectification
left_rectified = cv2.remap(left_img, map1_left, map2_left, cv2.INTER_LINEAR)
right_rectified = cv2.remap(right_img, map1_right, map2_right, cv2.INTER_LINEAR)
Disparity Matching Algorithms
Block Matching
# OpenCV Semi-Global Block Matching
stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=128, # Must be a multiple of 16
blockSize=5, # Matching window size
P1=8 * 3 * 5**2, # Disparity smoothness parameter 1
P2=32 * 3 * 5**2, # Disparity smoothness parameter 2
disp12MaxDiff=1, # Left-right consistency check
uniquenessRatio=10, # Uniqueness constraint
speckleWindowSize=100, # Speckle removal
speckleRange=32
)
# Compute disparity map
disparity = stereo.compute(left_rectified, right_rectified)
disparity_float = disparity.astype(np.float32) / 16.0 # Remove fixed-point decimal
# Disparity to depth
depth = (focal_length * baseline) / (disparity_float + 1e-6)
depth[disparity_float <= 0] = 0 # Invalid disparity
Matching Algorithm Comparison
| Algorithm | Accuracy | Speed | Features |
|---|---|---|---|
| BM (Block Matching) | Low | Fastest (~100fps) | Simple local matching |
| SGBM (Semi-Global) | Medium | Medium (~30fps) | Multi-direction cost aggregation |
| ELAS | Medium-High | Medium (~20fps) | Adaptive support points |
| AANet (Deep Learning) | High | Medium (~20fps on GPU) | End-to-end learning |
| RAFT-Stereo (Deep Learning) | Highest | Slow (~5fps on GPU) | Iterative optical flow |
Deep Learning Stereo Matching
import torch
from raft_stereo import RAFTStereo
# Load pretrained model
model = RAFTStereo(args)
model.load_state_dict(torch.load('raft-stereo.pth'))
model.eval()
with torch.no_grad():
# Input left and right images
left = torch.from_numpy(left_img).permute(2,0,1).float().unsqueeze(0).cuda()
right = torch.from_numpy(right_img).permute(2,0,1).float().unsqueeze(0).cuda()
# Predict disparity
_, disparity = model(left, right, iters=20)
disparity = disparity.squeeze().cpu().numpy()
Structured Light
Principle
Structured light projects known light patterns onto a scene and computes depth by observing pattern deformation:
Projector Camera
* *
|\ /|
| \ Known / |
| \ pattern / |
| \ ||| / |
| \||| / |
======Object======
Pattern deformation reflects surface shape
Speckle Pattern
RealSense D435i uses infrared speckle projection:
- Projects random infrared dot patterns
- Left and right IR cameras observe the speckle pattern
- Pattern shift = disparity
Advantage: Can obtain depth even on textureless surfaces (white walls, tabletops)
Disadvantage: Outdoor sunlight overwhelms infrared speckle
Coded Structured Light
Uses spatially or temporally encoded patterns:
| Encoding Method | Number of Patterns | Speed | Accuracy | Representative |
|---|---|---|---|---|
| Binary coding | Multiple frames | Slow | High | Industrial 3D scanning |
| Gray code | \(\log_2 N\) frames | Medium | High | Structured light scanners |
| Phase shifting | 3-4 frames | Medium | Highest | Precision measurement |
| Single-frame coding | 1 frame | Fast | Medium | Azure Kinect |
Structured Light Depth Calculation
Similar to stereo vision, but one "camera" is replaced by a projector:
Where \(d_{\text{pattern}}\) is the pattern offset in the camera image.
Time-of-Flight (ToF)
Principle
ToF cameras emit modulated light and measure the round-trip time:
Where \(c\) is the speed of light (\(3 \times 10^8\) m/s) and \(\Delta t\) is the round-trip time.
Continuous-Wave ToF (iToF)
In practice, phase difference of modulated light is used for ranging:
Emitted signal: \(s(t) = A \cos(2\pi f_m t)\)
Received signal: \(r(t) = A' \cos(2\pi f_m t + \phi)\)
Depth:
Maximum unambiguous distance:
Maximum Measurement Distance
Modulation frequency \(f_m = 20\text{MHz}\):
Direct ToF (dToF)
Uses SPAD (Single-Photon Avalanche Diode) to directly measure photon flight time:
- Higher accuracy (mm-level)
- Stronger ambient light resistance
- Apple LiDAR Scanner (iPad Pro) uses dToF
- Higher cost
Comparison of Three Technologies
| Feature | Passive Stereo | Active Stereo | Structured Light | ToF |
|---|---|---|---|---|
| Outdoor Performance | Good | Medium | Poor | Medium |
| Textureless Surfaces | Poor | Good | Good | Good |
| Depth Range | 0.5-20m | 0.3-10m | 0.2-4m | 0.1-8m |
| Accuracy @1m | 3-5mm | 1-2mm | <1mm | 1-3mm |
| Resolution | High | High | High | Low (VGA) |
| Frame Rate | High (100fps) | High (90fps) | Low (30fps) | High (60fps) |
| Computation | Heavy | Medium | Light | Minimal |
| Multi-Device Interference | None | Possible | Severe | Possible |
| Power Consumption | Low | Medium | Medium | Medium |
| Cost | Medium | Medium | Medium | High |
Practical Application Selection
Scenario Matching
| Application Scenario | Recommended Technology | Reason |
|---|---|---|
| Indoor navigation | Active stereo | Textureless walls need projection |
| Outdoor navigation | Passive stereo + LiDAR | IR unreliable in sunlight |
| Robotic arm grasping | Structured light / active stereo | Close-range high accuracy |
| Human body tracking | ToF / active stereo | Fast, simple body shapes |
| 3D scanning | Structured light (multi-frame) | Highest accuracy |
| High-speed scenarios | Passive stereo / ToF | High frame rate, low latency |
Point Cloud Fusion
import open3d as o3d
# Generate point cloud from depth map
intrinsic = o3d.camera.PinholeCameraIntrinsic(
640, 480, fx, fy, cx, cy
)
# Multi-frame point cloud fusion
volume = o3d.pipelines.integration.ScalableTSDFVolume(
voxel_length=0.005, # 5mm voxel
sdf_trunc=0.02,
color_type=o3d.pipelines.integration.TSDFVolumeColorType.RGB8
)
for depth, color, pose in frames:
rgbd = o3d.geometry.RGBDImage.create_from_color_and_depth(
o3d.geometry.Image(color),
o3d.geometry.Image(depth),
depth_trunc=3.0
)
volume.integrate(rgbd, intrinsic, np.linalg.inv(pose))
# Extract mesh
mesh = volume.extract_triangle_mesh()
mesh.compute_vertex_normals()
o3d.visualization.draw_geometries([mesh])
Summary
- Stereo depth formula \(Z = fB/d\) is the core of stereo vision
- Depth error is proportional to the square of distance; baseline and focal length determine the accuracy ceiling
- Epipolar rectification reduces 2D search to 1D and is a prerequisite for stereo matching
- Structured light has clear advantages on textureless surfaces but is limited outdoors
- ToF is independent of texture and ambient light but has lower resolution
- Practical systems typically combine multiple depth sensing technologies
References
- Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision
- Scharstein, D., & Szeliski, R. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms"
- Intel RealSense White Paper: Active IR Stereo
- Geng, J. "Structured-light 3D surface imaging: a tutorial"