Overview of Robot Vision
Overview
Vision is one of the most important ways for robots to perceive the world. As the "eyes" of robots, cameras provide rich environmental information for navigation, manipulation, human-robot interaction, and other tasks. This section provides an overview of the basic components, camera types, and design considerations for robot vision systems.
The Role of Vision in Robotics
Core Applications
| Application Domain | Task | Camera Requirements |
|---|---|---|
| Navigation | SLAM, obstacle avoidance, path planning | Depth information, wide FOV, high frame rate |
| Manipulation | Object detection, grasp pose estimation | High resolution, depth accuracy |
| Human-Robot Interaction | Human detection, gesture recognition | RGB HD, low latency |
| Inspection | Defect detection, meter reading | High resolution, color accuracy |
| Autonomous Driving | Lane detection, pedestrian detection | Wide dynamic range, high frame rate |
Vision Processing Pipeline
graph LR
A[Optical System<br>Lens+Sensor] --> B[Image Acquisition<br>ISP/Decoding]
B --> C[Preprocessing<br>Undistortion/Enhancement]
C --> D[Feature Extraction<br>Traditional/Deep Learning]
D --> E[High-Level Understanding<br>Detection/Segmentation/Estimation]
E --> F[Decision/Control<br>Planning/Action]
style A fill:#fff3e0
style B fill:#ffe0b2
style C fill:#ffcc80
style D fill:#ffb74d
style E fill:#ffa726
style F fill:#ff9800
Camera Type Overview
Classification by Information Acquired
| Type | Output | Advantage | Disadvantage | Representative Product |
|---|---|---|---|---|
| RGB Camera | 2D color image | Low cost, high resolution | No depth information | IMX219, C920 |
| Depth Camera | RGB + depth map | Direct 3D acquisition | Limited outdoors, limited range | RealSense D435i |
| Stereo Camera | Left/right images -> depth | Passive ranging, outdoor capable | High computation | ZED 2i |
| Structured Light | Projected patterns -> depth | High precision | Indoor only, short range | Kinect Azure |
| ToF Camera | Time-of-flight -> depth | Fast, texture-independent | Low resolution | PMD Flexx2 |
| Event Camera | Asynchronous brightness changes | Ultra-high temporal resolution | Immature ecosystem | DAVIS346 |
| Panoramic Camera | 360-degree image | Full coverage | Large distortion | Ricoh Theta |
Classification by Interface
| Interface | Bandwidth | Latency | Cable Length | CPU Usage | Suitable Scenario |
|---|---|---|---|---|---|
| CSI-2 | 2.5 Gbps/lane (4 lanes) | Lowest | <30cm | Very low (ISP hardware) | Embedded systems |
| USB 2.0 | 480 Mbps | Medium | <5m | Medium | Low-resolution cameras |
| USB 3.0 | 5 Gbps | Medium | <3m | Medium | Mainstream cameras |
| GigE Vision | 1 Gbps | Medium | <100m | Low | Industrial cameras |
| 10GigE | 10 Gbps | Low | <100m | Low | High-speed industrial cameras |
Resolution and Frame Rate
Resolution Selection
| Resolution | Pixel Count | Data Size (RGB) | Typical FPS | Suitable Scenario |
|---|---|---|---|---|
| VGA (640x480) | 0.3MP | 0.9MB/frame | 60-120fps | Real-time navigation, obstacle avoidance |
| 720p (1280x720) | 0.9MP | 2.7MB/frame | 30-60fps | General vision tasks |
| 1080p (1920x1080) | 2.1MP | 6.2MB/frame | 30fps | Object detection, recognition |
| 4K (3840x2160) | 8.3MP | 24.9MB/frame | 15-30fps | High-precision detection |
Frame Rate Requirements
Where:
- \(v_{\text{max}}\): Maximum robot speed
- \(d_{\text{min}}\): Minimum object detection distance
- \(\text{FOV}_{\text{pixel}}\): Field of view per pixel
Frame Rate Calculation Example
Mobile robot speed 1m/s, need to detect obstacles at 2m, horizontal FOV 60 degrees, resolution 640px:
In practice, since detection algorithms tolerate several pixels of motion blur, 30fps is usually sufficient.
Bandwidth Calculation
| Configuration | Raw Bandwidth | Compressed (MJPEG) |
|---|---|---|
| 640x480 RGB @30fps | 27.6 MB/s | ~5 MB/s |
| 1080p RGB @30fps | 186 MB/s | ~30 MB/s |
| 1080p RGB @60fps | 373 MB/s | ~60 MB/s |
| 4K RGB @30fps | 746 MB/s | ~100 MB/s |
Key Optical Parameters
Focal Length and Field of View
Focal length \(f\) and sensor size determine the field of view (FOV):
Where \(d\) is the sensor dimension in that direction.
| Focal Length | FOV (1/2.3" sensor) | Features | Application |
|---|---|---|---|
| 2.1mm | ~120 degrees | Ultra-wide angle | Panoramic, obstacle avoidance |
| 3.6mm | ~80 degrees | Wide angle | Navigation, SLAM |
| 6mm | ~50 degrees | Standard | Object detection |
| 12mm | ~25 degrees | Narrow angle | Long-range recognition |
| 25mm | ~12 degrees | Telephoto | Inspection, meter reading |
Aperture and Depth of Field
Aperture (F-number) affects light intake and depth of field:
Where \(F\) is the F-number, \(d\) is the focus distance, and \(f\) is the focal length.
- Small F-number (large aperture): More light, shallow depth of field -- suitable for dark environments
- Large F-number (small aperture): Less light, deep depth of field -- suitable for all-in-focus scenarios
Shutter Types
| Type | Principle | Advantages | Disadvantages | Suitable For |
|---|---|---|---|---|
| Rolling Shutter | Row-by-row exposure | Low cost, high resolution | Motion distortion (jello effect) | Static/slow scenes |
| Global Shutter | All pixels exposed simultaneously | No motion distortion | Higher cost, slightly more noise | High-speed motion, VIO |
Rolling Shutter Impact on SLAM
In visual SLAM, rolling shutter causes feature point position shifts, affecting pose estimation accuracy. Robots moving at high speed should prioritize global shutter cameras.
Detailed Vision Processing Pipeline
Hardware Pipeline
graph TB
subgraph Sensor Side
A[CMOS Sensor] --> B[Analog Signal]
B --> C[ADC]
C --> D[Raw Data<br>RAW Bayer]
end
subgraph ISP Processing
D --> E[Black Level Correction]
E --> F[Dead Pixel Correction]
F --> G[Demosaicing]
G --> H[White Balance]
H --> I[Denoising]
I --> J[Color Correction<br>CCM]
J --> K[Gamma Correction]
K --> L[Sharpening]
end
subgraph Output
L --> M[RGB/YUV]
M --> N[Encoding<br>MJPEG/H.264]
N --> O[Transfer<br>CSI/USB]
end
Software Processing Pipeline
import cv2
import numpy as np
class VisionPipeline:
def __init__(self, camera_matrix, dist_coeffs):
self.K = camera_matrix
self.D = dist_coeffs
# Pre-compute undistortion maps
self.map1, self.map2 = cv2.initUndistortRectifyMap(
self.K, self.D, None, self.K, (640, 480), cv2.CV_16SC2
)
def process(self, frame):
# 1. Undistortion
undistorted = cv2.remap(frame, self.map1, self.map2,
cv2.INTER_LINEAR)
# 2. Color space conversion
gray = cv2.cvtColor(undistorted, cv2.COLOR_BGR2GRAY)
# 3. Histogram equalization (handle lighting changes)
enhanced = cv2.equalizeHist(gray)
# 4. Feature extraction or feed to neural network
return enhanced
Indoor vs. Outdoor Environments
| Challenge | Indoor | Outdoor |
|---|---|---|
| Lighting | Relatively stable | Extremely variable (shadows, direct sunlight) |
| Dynamic Range | ~60dB sufficient | Need >100dB (HDR) |
| Distance | 1-10m | 1m-100m+ |
| Weather | Unaffected | Rain, fog, dust |
| Depth Camera | IR projection usable | IR overwhelmed by sunlight |
| Recommended Sensor | RGB-D (RealSense) | Stereo/LiDAR |
Typical Robot Vision System Configurations
Indoor Mobile Robot
| Component | Model | Use |
|---|---|---|
| Front depth camera | RealSense D435i | Obstacle avoidance + SLAM |
| Arm eye-in-hand camera | RealSense D405 | Grasp localization |
| Rear fisheye camera | OV9281 | Rear perception |
| Computing platform | Jetson Orin Nano | Inference + ROS2 |
Autonomous Driving Cart
| Component | Model | Use |
|---|---|---|
| Front stereo camera | ZED 2i | Depth + visual SLAM |
| Surround fisheye cameras x4 | IMX219 (wide angle) | 360-degree perception |
| LiDAR | Livox Mid-360 | 3D mapping |
| Computing platform | Jetson Orin NX | Multi-model inference |
Selection Decision Flow
graph TD
A[Vision Task Requirements] --> B{Need depth information?}
B -->|No| C{Resolution requirement?}
B -->|Yes| D{Operating environment?}
C -->|>4K| E[Industrial Camera<br>GigE Vision]
C -->|1080p| F[USB Camera<br>IMX477/C920]
C -->|720p or below| G[CSI Camera<br>IMX219]
D -->|Indoor| H{Depth range?}
D -->|Outdoor| I[Stereo Camera<br>ZED 2i]
H -->|<1m| J[Structured Light<br>D405]
H -->|1-10m| K[Active Stereo<br>D435i/D455]
H -->|>10m| L[LiDAR+RGB]
Summary
- Cameras are among the most important sensors for robots; selection must match the specific task
- CSI interface has the lowest latency for embedded use; USB is flexible; GigE suits long distances
- Resolution and frame rate must be balanced under bandwidth constraints
- Global shutter is critical for high-speed motion and VIO
- Use RGB-D indoors, use stereo or LiDAR-assisted outdoors
- The vision pipeline spans the complete chain from optics to software
References
- Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision
- Szeliski, R. Computer Vision: Algorithms and Applications
- Intel RealSense Documentation: https://dev.intelrealsense.com/
- OpenCV Documentation: https://docs.opencv.org/