Skip to content

Stereo Vision and Structured Light

Overview

Stereo vision and structured light are two major depth sensing technologies. Stereo vision mimics human binocular vision to compute depth through disparity, while structured light measures depth by projecting known patterns and observing their deformation. This section provides an in-depth look at the mathematical principles, engineering implementations, and performance comparisons.

Stereo Vision Principles

Basic Geometry

A stereo camera consists of two horizontally aligned cameras that compute depth through triangulation:

Left camera O_L          Right camera O_R
  *-------- B --------*
  |\                 /|
  | \               / |
  |  \             /  |
  |   \    P(X,Y,Z)  |
  |    \   *     /    |
  |     \ / \   /     |
  |------*---*--------| Image plane
      x_L    x_R

Disparity and Depth

Disparity is the horizontal position difference of the same point in left and right images:

\[ d = x_L - x_R \]

From similar triangles:

\[ \frac{B}{Z} = \frac{d}{f} \]

Therefore depth is:

\[ \boxed{Z = \frac{fB}{d}} \]

Where:

  • \(Z\): Depth (meters)
  • \(f\): Focal length (pixels)
  • \(B\): Baseline length (meters)
  • \(d\): Disparity (pixels)

Depth Resolution

Minimum depth resolution (depth difference corresponding to adjacent disparity values):

\[ \Delta Z = \frac{Z^2}{fB} \cdot \Delta d \]

When \(\Delta d = 1\) (integer pixel disparity), depth resolution grows with the square of distance.

Depth Resolution Calculation

Parameters: \(f=400\text{px}\), \(B=0.12\text{m}\) (ZED 2i)

  • At 1m: \(\Delta Z = \frac{1^2}{400 \times 0.12} = 0.021\text{m} = 2.1\text{cm}\)
  • At 5m: \(\Delta Z = \frac{25}{48} = 0.52\text{m} = 52\text{cm}\)

Sub-pixel disparity (0.1px accuracy) can improve resolution by 10x.

Depth Error Analysis

\[ \sigma_Z = \frac{Z^2}{fB} \cdot \sigma_d \]

Three paths to improve depth accuracy:

  1. Increase baseline \(B\): But the near-range blind zone also increases (\(Z_{\min} \approx \frac{fB}{d_{\max}}\))
  2. Increase focal length \(f\): But FOV decreases
  3. Reduce disparity error \(\sigma_d\): Better matching algorithms, sub-pixel interpolation

Epipolar Geometry

Epipolar Constraint

For a calibrated stereo system, a point in the left image must lie on the epipolar line in the right image:

\[ \mathbf{x}_R^T \mathbf{F} \mathbf{x}_L = 0 \]

Where \(\mathbf{F}\) is the Fundamental Matrix.

For rectified stereo, epipolar lines become horizontal, reducing the search from 2D to 1D:

Left image:                      Right image:
+------------------+            +------------------+
|     *P_L         |            |         *P_R     |
|----epipolar line--|  ->       |----epipolar line--|
|                  |            |                  |
+------------------+            +------------------+
P_L and P_R are on the same row (same y coordinate)

Rectification

Stereo rectification makes the two camera image planes coplanar and row-aligned:

import cv2
import numpy as np

# Known calibration parameters
K_left = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
K_right = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
D_left = np.array([k1, k2, p1, p2, k3])
D_right = np.array([k1, k2, p1, p2, k3])
R = rotation_matrix_3x3  # Right camera rotation relative to left
T = translation_vector    # Right camera translation relative to left

# Stereo rectification
R1, R2, P1, P2, Q, roi1, roi2 = cv2.stereoRectify(
    K_left, D_left, K_right, D_right,
    image_size, R, T,
    flags=cv2.CALIB_ZERO_DISPARITY,
    alpha=0
)

# Generate mapping tables
map1_left, map2_left = cv2.initUndistortRectifyMap(
    K_left, D_left, R1, P1, image_size, cv2.CV_16SC2
)
map1_right, map2_right = cv2.initUndistortRectifyMap(
    K_right, D_right, R2, P2, image_size, cv2.CV_16SC2
)

# Apply rectification
left_rectified = cv2.remap(left_img, map1_left, map2_left, cv2.INTER_LINEAR)
right_rectified = cv2.remap(right_img, map1_right, map2_right, cv2.INTER_LINEAR)

Disparity Matching Algorithms

Block Matching

# OpenCV Semi-Global Block Matching
stereo = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=128,     # Must be a multiple of 16
    blockSize=5,             # Matching window size
    P1=8 * 3 * 5**2,        # Disparity smoothness parameter 1
    P2=32 * 3 * 5**2,       # Disparity smoothness parameter 2
    disp12MaxDiff=1,         # Left-right consistency check
    uniquenessRatio=10,      # Uniqueness constraint
    speckleWindowSize=100,   # Speckle removal
    speckleRange=32
)

# Compute disparity map
disparity = stereo.compute(left_rectified, right_rectified)
disparity_float = disparity.astype(np.float32) / 16.0  # Remove fixed-point decimal

# Disparity to depth
depth = (focal_length * baseline) / (disparity_float + 1e-6)
depth[disparity_float <= 0] = 0  # Invalid disparity

Matching Algorithm Comparison

Algorithm Accuracy Speed Features
BM (Block Matching) Low Fastest (~100fps) Simple local matching
SGBM (Semi-Global) Medium Medium (~30fps) Multi-direction cost aggregation
ELAS Medium-High Medium (~20fps) Adaptive support points
AANet (Deep Learning) High Medium (~20fps on GPU) End-to-end learning
RAFT-Stereo (Deep Learning) Highest Slow (~5fps on GPU) Iterative optical flow

Deep Learning Stereo Matching

import torch
from raft_stereo import RAFTStereo

# Load pretrained model
model = RAFTStereo(args)
model.load_state_dict(torch.load('raft-stereo.pth'))
model.eval()

with torch.no_grad():
    # Input left and right images
    left = torch.from_numpy(left_img).permute(2,0,1).float().unsqueeze(0).cuda()
    right = torch.from_numpy(right_img).permute(2,0,1).float().unsqueeze(0).cuda()

    # Predict disparity
    _, disparity = model(left, right, iters=20)
    disparity = disparity.squeeze().cpu().numpy()

Structured Light

Principle

Structured light projects known light patterns onto a scene and computes depth by observing pattern deformation:

Projector          Camera
  *               *
  |\             /|
  | \ Known     / |
  |  \ pattern /  |
  |   \ |||  /   |
  |    \||| /    |
  ======Object======
  Pattern deformation reflects surface shape

Speckle Pattern

RealSense D435i uses infrared speckle projection:

  • Projects random infrared dot patterns
  • Left and right IR cameras observe the speckle pattern
  • Pattern shift = disparity

Advantage: Can obtain depth even on textureless surfaces (white walls, tabletops)

Disadvantage: Outdoor sunlight overwhelms infrared speckle

Coded Structured Light

Uses spatially or temporally encoded patterns:

Encoding Method Number of Patterns Speed Accuracy Representative
Binary coding Multiple frames Slow High Industrial 3D scanning
Gray code \(\log_2 N\) frames Medium High Structured light scanners
Phase shifting 3-4 frames Medium Highest Precision measurement
Single-frame coding 1 frame Fast Medium Azure Kinect

Structured Light Depth Calculation

Similar to stereo vision, but one "camera" is replaced by a projector:

\[ Z = \frac{fB}{d_{\text{pattern}}} \]

Where \(d_{\text{pattern}}\) is the pattern offset in the camera image.

Time-of-Flight (ToF)

Principle

ToF cameras emit modulated light and measure the round-trip time:

\[ d = \frac{c \cdot \Delta t}{2} \]

Where \(c\) is the speed of light (\(3 \times 10^8\) m/s) and \(\Delta t\) is the round-trip time.

Continuous-Wave ToF (iToF)

In practice, phase difference of modulated light is used for ranging:

Emitted signal: \(s(t) = A \cos(2\pi f_m t)\)

Received signal: \(r(t) = A' \cos(2\pi f_m t + \phi)\)

Depth:

\[ d = \frac{c \cdot \phi}{4\pi f_m} \]

Maximum unambiguous distance:

\[ d_{\max} = \frac{c}{2 f_m} \]

Maximum Measurement Distance

Modulation frequency \(f_m = 20\text{MHz}\):

\[d_{\max} = \frac{3 \times 10^8}{2 \times 20 \times 10^6} = 7.5\text{m}\]

Direct ToF (dToF)

Uses SPAD (Single-Photon Avalanche Diode) to directly measure photon flight time:

  • Higher accuracy (mm-level)
  • Stronger ambient light resistance
  • Apple LiDAR Scanner (iPad Pro) uses dToF
  • Higher cost

Comparison of Three Technologies

Feature Passive Stereo Active Stereo Structured Light ToF
Outdoor Performance Good Medium Poor Medium
Textureless Surfaces Poor Good Good Good
Depth Range 0.5-20m 0.3-10m 0.2-4m 0.1-8m
Accuracy @1m 3-5mm 1-2mm <1mm 1-3mm
Resolution High High High Low (VGA)
Frame Rate High (100fps) High (90fps) Low (30fps) High (60fps)
Computation Heavy Medium Light Minimal
Multi-Device Interference None Possible Severe Possible
Power Consumption Low Medium Medium Medium
Cost Medium Medium Medium High

Practical Application Selection

Scenario Matching

Application Scenario Recommended Technology Reason
Indoor navigation Active stereo Textureless walls need projection
Outdoor navigation Passive stereo + LiDAR IR unreliable in sunlight
Robotic arm grasping Structured light / active stereo Close-range high accuracy
Human body tracking ToF / active stereo Fast, simple body shapes
3D scanning Structured light (multi-frame) Highest accuracy
High-speed scenarios Passive stereo / ToF High frame rate, low latency

Point Cloud Fusion

import open3d as o3d

# Generate point cloud from depth map
intrinsic = o3d.camera.PinholeCameraIntrinsic(
    640, 480, fx, fy, cx, cy
)

# Multi-frame point cloud fusion
volume = o3d.pipelines.integration.ScalableTSDFVolume(
    voxel_length=0.005,  # 5mm voxel
    sdf_trunc=0.02,
    color_type=o3d.pipelines.integration.TSDFVolumeColorType.RGB8
)

for depth, color, pose in frames:
    rgbd = o3d.geometry.RGBDImage.create_from_color_and_depth(
        o3d.geometry.Image(color),
        o3d.geometry.Image(depth),
        depth_trunc=3.0
    )
    volume.integrate(rgbd, intrinsic, np.linalg.inv(pose))

# Extract mesh
mesh = volume.extract_triangle_mesh()
mesh.compute_vertex_normals()
o3d.visualization.draw_geometries([mesh])

Summary

  1. Stereo depth formula \(Z = fB/d\) is the core of stereo vision
  2. Depth error is proportional to the square of distance; baseline and focal length determine the accuracy ceiling
  3. Epipolar rectification reduces 2D search to 1D and is a prerequisite for stereo matching
  4. Structured light has clear advantages on textureless surfaces but is limited outdoors
  5. ToF is independent of texture and ambient light but has lower resolution
  6. Practical systems typically combine multiple depth sensing technologies

References

  • Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision
  • Scharstein, D., & Szeliski, R. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms"
  • Intel RealSense White Paper: Active IR Stereo
  • Geng, J. "Structured-light 3D surface imaging: a tutorial"

评论 #