Stereo Vision and Structured Light

Overview

Stereo vision and structured light are two major depth sensing technologies. Stereo vision mimics human binocular vision to compute depth through disparity, while structured light measures depth by projecting known patterns and observing their deformation. This section provides an in-depth look at the mathematical principles, engineering implementations, and performance comparisons.

Stereo Vision Principles

Basic Geometry

A stereo camera consists of two horizontally aligned cameras that compute depth through triangulation:

Left camera O_L          Right camera O_R
  *-------- B --------*
  |\                 /|
  | \               / |
  |  \             /  |
  |   \    P(X,Y,Z)  |
  |    \   *     /    |
  |     \ / \   /     |
  |------*---*--------| Image plane
      x_L    x_R

Disparity and Depth

Disparity is the horizontal position difference of the same point in left and right images:

\[ d = x_L - x_R \]

From similar triangles:

\[ \frac{B}{Z} = \frac{d}{f} \]

Therefore depth is:

\[ \boxed{Z = \frac{fB}{d}} \]

Where:

\(Z\): Depth (meters)
\(f\): Focal length (pixels)
\(B\): Baseline length (meters)
\(d\): Disparity (pixels)

Depth Resolution

Minimum depth resolution (depth difference corresponding to adjacent disparity values):

\[ \Delta Z = \frac{Z^2}{fB} \cdot \Delta d \]

When \(\Delta d = 1\) (integer pixel disparity), depth resolution grows with the square of distance.

Depth Resolution Calculation

Parameters: \(f=400\text{px}\), \(B=0.12\text{m}\) (ZED 2i)

At 1m: \(\Delta Z = \frac{1^2}{400 \times 0.12} = 0.021\text{m} = 2.1\text{cm}\)
At 5m: \(\Delta Z = \frac{25}{48} = 0.52\text{m} = 52\text{cm}\)

Sub-pixel disparity (0.1px accuracy) can improve resolution by 10x.

Depth Error Analysis

\[ \sigma_Z = \frac{Z^2}{fB} \cdot \sigma_d \]

Three paths to improve depth accuracy:

Increase baseline \(B\): But the near-range blind zone also increases (\(Z_{\min} \approx \frac{fB}{d_{\max}}\))
Increase focal length \(f\): But FOV decreases
Reduce disparity error \(\sigma_d\): Better matching algorithms, sub-pixel interpolation

Epipolar Geometry

Epipolar Constraint

For a calibrated stereo system, a point in the left image must lie on the epipolar line in the right image:

\[ \mathbf{x}_R^T \mathbf{F} \mathbf{x}_L = 0 \]

Where \(\mathbf{F}\) is the Fundamental Matrix.

For rectified stereo, epipolar lines become horizontal, reducing the search from 2D to 1D:

Left image:                      Right image:
+------------------+            +------------------+
|     *P_L         |            |         *P_R     |
|----epipolar line--|  ->       |----epipolar line--|
|                  |            |                  |
+------------------+            +------------------+
P_L and P_R are on the same row (same y coordinate)

Rectification

Stereo rectification makes the two camera image planes coplanar and row-aligned:

import cv2
import numpy as np

# Known calibration parameters
K_left = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
K_right = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
D_left = np.array([k1, k2, p1, p2, k3])
D_right = np.array([k1, k2, p1, p2, k3])
R = rotation_matrix_3x3  # Right camera rotation relative to left
T = translation_vector    # Right camera translation relative to left

# Stereo rectification
R1, R2, P1, P2, Q, roi1, roi2 = cv2.stereoRectify(
    K_left, D_left, K_right, D_right,
    image_size, R, T,
    flags=cv2.CALIB_ZERO_DISPARITY,
    alpha=0
)

# Generate mapping tables
map1_left, map2_left = cv2.initUndistortRectifyMap(
    K_left, D_left, R1, P1, image_size, cv2.CV_16SC2
)
map1_right, map2_right = cv2.initUndistortRectifyMap(
    K_right, D_right, R2, P2, image_size, cv2.CV_16SC2
)

# Apply rectification
left_rectified = cv2.remap(left_img, map1_left, map2_left, cv2.INTER_LINEAR)
right_rectified = cv2.remap(right_img, map1_right, map2_right, cv2.INTER_LINEAR)

Disparity Matching Algorithms

Block Matching

# OpenCV Semi-Global Block Matching
stereo = cv2.StereoSGBM_create(
    minDisparity=0,
    numDisparities=128,     # Must be a multiple of 16
    blockSize=5,             # Matching window size
    P1=8 * 3 * 5**2,        # Disparity smoothness parameter 1
    P2=32 * 3 * 5**2,       # Disparity smoothness parameter 2
    disp12MaxDiff=1,         # Left-right consistency check
    uniquenessRatio=10,      # Uniqueness constraint
    speckleWindowSize=100,   # Speckle removal
    speckleRange=32
)

# Compute disparity map
disparity = stereo.compute(left_rectified, right_rectified)
disparity_float = disparity.astype(np.float32) / 16.0  # Remove fixed-point decimal

# Disparity to depth
depth = (focal_length * baseline) / (disparity_float + 1e-6)
depth[disparity_float <= 0] = 0  # Invalid disparity

Matching Algorithm Comparison

Algorithm	Accuracy	Speed	Features
BM (Block Matching)	Low	Fastest (~100fps)	Simple local matching
SGBM (Semi-Global)	Medium	Medium (~30fps)	Multi-direction cost aggregation
ELAS	Medium-High	Medium (~20fps)	Adaptive support points
AANet (Deep Learning)	High	Medium (~20fps on GPU)	End-to-end learning
RAFT-Stereo (Deep Learning)	Highest	Slow (~5fps on GPU)	Iterative optical flow

Deep Learning Stereo Matching

import torch
from raft_stereo import RAFTStereo

# Load pretrained model
model = RAFTStereo(args)
model.load_state_dict(torch.load('raft-stereo.pth'))
model.eval()

with torch.no_grad():
    # Input left and right images
    left = torch.from_numpy(left_img).permute(2,0,1).float().unsqueeze(0).cuda()
    right = torch.from_numpy(right_img).permute(2,0,1).float().unsqueeze(0).cuda()

    # Predict disparity
    _, disparity = model(left, right, iters=20)
    disparity = disparity.squeeze().cpu().numpy()

Structured Light

Principle

Structured light projects known light patterns onto a scene and computes depth by observing pattern deformation:

Projector          Camera
  *               *
  |\             /|
  | \ Known     / |
  |  \ pattern /  |
  |   \ |||  /   |
  |    \||| /    |
  ======Object======
  Pattern deformation reflects surface shape

Speckle Pattern

RealSense D435i uses infrared speckle projection:

Projects random infrared dot patterns
Left and right IR cameras observe the speckle pattern
Pattern shift = disparity

Advantage: Can obtain depth even on textureless surfaces (white walls, tabletops)

Disadvantage: Outdoor sunlight overwhelms infrared speckle

Coded Structured Light

Uses spatially or temporally encoded patterns:

Encoding Method	Number of Patterns	Speed	Accuracy	Representative
Binary coding	Multiple frames	Slow	High	Industrial 3D scanning
Gray code	\(\log_2 N\) frames	Medium	High	Structured light scanners
Phase shifting	3-4 frames	Medium	Highest	Precision measurement
Single-frame coding	1 frame	Fast	Medium	Azure Kinect

Structured Light Depth Calculation

Similar to stereo vision, but one "camera" is replaced by a projector:

\[ Z = \frac{fB}{d_{\text{pattern}}} \]

Where \(d_{\text{pattern}}\) is the pattern offset in the camera image.

Time-of-Flight (ToF)

Principle

ToF cameras emit modulated light and measure the round-trip time:

\[ d = \frac{c \cdot \Delta t}{2} \]

Where \(c\) is the speed of light (\(3 \times 10^8\) m/s) and \(\Delta t\) is the round-trip time.

Continuous-Wave ToF (iToF)

In practice, phase difference of modulated light is used for ranging:

Emitted signal: \(s(t) = A \cos(2\pi f_m t)\)

Received signal: \(r(t) = A' \cos(2\pi f_m t + \phi)\)

Depth:

\[ d = \frac{c \cdot \phi}{4\pi f_m} \]

Maximum unambiguous distance:

\[ d_{\max} = \frac{c}{2 f_m} \]

Maximum Measurement Distance

Modulation frequency \(f_m = 20\text{MHz}\):

\[d_{\max} = \frac{3 \times 10^8}{2 \times 20 \times 10^6} = 7.5\text{m}\]

Direct ToF (dToF)

Uses SPAD (Single-Photon Avalanche Diode) to directly measure photon flight time:

Higher accuracy (mm-level)
Stronger ambient light resistance
Apple LiDAR Scanner (iPad Pro) uses dToF
Higher cost

Comparison of Three Technologies

Feature	Passive Stereo	Active Stereo	Structured Light	ToF
Outdoor Performance	Good	Medium	Poor	Medium
Textureless Surfaces	Poor	Good	Good	Good
Depth Range	0.5-20m	0.3-10m	0.2-4m	0.1-8m
Accuracy @1m	3-5mm	1-2mm	<1mm	1-3mm
Resolution	High	High	High	Low (VGA)
Frame Rate	High (100fps)	High (90fps)	Low (30fps)	High (60fps)
Computation	Heavy	Medium	Light	Minimal
Multi-Device Interference	None	Possible	Severe	Possible
Power Consumption	Low	Medium	Medium	Medium
Cost	Medium	Medium	Medium	High

Practical Application Selection

Scenario Matching

Application Scenario	Recommended Technology	Reason
Indoor navigation	Active stereo	Textureless walls need projection
Outdoor navigation	Passive stereo + LiDAR	IR unreliable in sunlight
Robotic arm grasping	Structured light / active stereo	Close-range high accuracy
Human body tracking	ToF / active stereo	Fast, simple body shapes
3D scanning	Structured light (multi-frame)	Highest accuracy
High-speed scenarios	Passive stereo / ToF	High frame rate, low latency

Point Cloud Fusion

import open3d as o3d

# Generate point cloud from depth map
intrinsic = o3d.camera.PinholeCameraIntrinsic(
    640, 480, fx, fy, cx, cy
)

# Multi-frame point cloud fusion
volume = o3d.pipelines.integration.ScalableTSDFVolume(
    voxel_length=0.005,  # 5mm voxel
    sdf_trunc=0.02,
    color_type=o3d.pipelines.integration.TSDFVolumeColorType.RGB8
)

for depth, color, pose in frames:
    rgbd = o3d.geometry.RGBDImage.create_from_color_and_depth(
        o3d.geometry.Image(color),
        o3d.geometry.Image(depth),
        depth_trunc=3.0
    )
    volume.integrate(rgbd, intrinsic, np.linalg.inv(pose))

# Extract mesh
mesh = volume.extract_triangle_mesh()
mesh.compute_vertex_normals()
o3d.visualization.draw_geometries([mesh])

Summary

Stereo depth formula \(Z = fB/d\) is the core of stereo vision
Depth error is proportional to the square of distance; baseline and focal length determine the accuracy ceiling
Epipolar rectification reduces 2D search to 1D and is a prerequisite for stereo matching
Structured light has clear advantages on textureless surfaces but is limited outdoors
ToF is independent of texture and ambient light but has lower resolution
Practical systems typically combine multiple depth sensing technologies

References

Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision
Scharstein, D., & Szeliski, R. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms"
Intel RealSense White Paper: Active IR Stereo
Geng, J. "Structured-light 3D surface imaging: a tutorial"

Stereo Vision and Structured Light

Overview

Stereo Vision Principles

Basic Geometry

Disparity and Depth

Depth Resolution

Depth Error Analysis

Epipolar Geometry

Epipolar Constraint

Rectification

Disparity Matching Algorithms

Block Matching

Matching Algorithm Comparison

Deep Learning Stereo Matching

Structured Light

Principle

Speckle Pattern

Coded Structured Light

Structured Light Depth Calculation

Time-of-Flight (ToF)

Principle

Continuous-Wave ToF (iToF)

Direct ToF (dToF)

Comparison of Three Technologies

Practical Application Selection

Scenario Matching

Point Cloud Fusion

Summary

References

评论 #