Skip to content

Overview of Robot Vision

Overview

Vision is one of the most important ways for robots to perceive the world. As the "eyes" of robots, cameras provide rich environmental information for navigation, manipulation, human-robot interaction, and other tasks. This section provides an overview of the basic components, camera types, and design considerations for robot vision systems.

The Role of Vision in Robotics

Core Applications

Application Domain Task Camera Requirements
Navigation SLAM, obstacle avoidance, path planning Depth information, wide FOV, high frame rate
Manipulation Object detection, grasp pose estimation High resolution, depth accuracy
Human-Robot Interaction Human detection, gesture recognition RGB HD, low latency
Inspection Defect detection, meter reading High resolution, color accuracy
Autonomous Driving Lane detection, pedestrian detection Wide dynamic range, high frame rate

Vision Processing Pipeline

graph LR
    A[Optical System<br>Lens+Sensor] --> B[Image Acquisition<br>ISP/Decoding]
    B --> C[Preprocessing<br>Undistortion/Enhancement]
    C --> D[Feature Extraction<br>Traditional/Deep Learning]
    D --> E[High-Level Understanding<br>Detection/Segmentation/Estimation]
    E --> F[Decision/Control<br>Planning/Action]

    style A fill:#fff3e0
    style B fill:#ffe0b2
    style C fill:#ffcc80
    style D fill:#ffb74d
    style E fill:#ffa726
    style F fill:#ff9800

Camera Type Overview

Classification by Information Acquired

Type Output Advantage Disadvantage Representative Product
RGB Camera 2D color image Low cost, high resolution No depth information IMX219, C920
Depth Camera RGB + depth map Direct 3D acquisition Limited outdoors, limited range RealSense D435i
Stereo Camera Left/right images -> depth Passive ranging, outdoor capable High computation ZED 2i
Structured Light Projected patterns -> depth High precision Indoor only, short range Kinect Azure
ToF Camera Time-of-flight -> depth Fast, texture-independent Low resolution PMD Flexx2
Event Camera Asynchronous brightness changes Ultra-high temporal resolution Immature ecosystem DAVIS346
Panoramic Camera 360-degree image Full coverage Large distortion Ricoh Theta

Classification by Interface

Interface Bandwidth Latency Cable Length CPU Usage Suitable Scenario
CSI-2 2.5 Gbps/lane (4 lanes) Lowest <30cm Very low (ISP hardware) Embedded systems
USB 2.0 480 Mbps Medium <5m Medium Low-resolution cameras
USB 3.0 5 Gbps Medium <3m Medium Mainstream cameras
GigE Vision 1 Gbps Medium <100m Low Industrial cameras
10GigE 10 Gbps Low <100m Low High-speed industrial cameras

Resolution and Frame Rate

Resolution Selection

Resolution Pixel Count Data Size (RGB) Typical FPS Suitable Scenario
VGA (640x480) 0.3MP 0.9MB/frame 60-120fps Real-time navigation, obstacle avoidance
720p (1280x720) 0.9MP 2.7MB/frame 30-60fps General vision tasks
1080p (1920x1080) 2.1MP 6.2MB/frame 30fps Object detection, recognition
4K (3840x2160) 8.3MP 24.9MB/frame 15-30fps High-precision detection

Frame Rate Requirements

\[ \text{Minimum FPS} = \frac{v_{\text{max}}}{d_{\text{min}}} \times \frac{1}{\text{FOV}_{\text{pixel}}} \]

Where:

  • \(v_{\text{max}}\): Maximum robot speed
  • \(d_{\text{min}}\): Minimum object detection distance
  • \(\text{FOV}_{\text{pixel}}\): Field of view per pixel

Frame Rate Calculation Example

Mobile robot speed 1m/s, need to detect obstacles at 2m, horizontal FOV 60 degrees, resolution 640px:

\[\text{Per-pixel FOV} = \frac{60°}{640} \approx 0.094°\]
\[\text{Minimum FPS} = \frac{1}{2} \times \frac{1}{0.094° \times \pi/180°} \approx 305 \text{ fps}\]

In practice, since detection algorithms tolerate several pixels of motion blur, 30fps is usually sufficient.

Bandwidth Calculation

\[ \text{Bandwidth} = \text{Width} \times \text{Height} \times \text{Channels} \times \text{Bit Depth} \times \text{FPS} \]
Configuration Raw Bandwidth Compressed (MJPEG)
640x480 RGB @30fps 27.6 MB/s ~5 MB/s
1080p RGB @30fps 186 MB/s ~30 MB/s
1080p RGB @60fps 373 MB/s ~60 MB/s
4K RGB @30fps 746 MB/s ~100 MB/s

Key Optical Parameters

Focal Length and Field of View

Focal length \(f\) and sensor size determine the field of view (FOV):

\[ \text{FOV} = 2 \arctan\left(\frac{d}{2f}\right) \]

Where \(d\) is the sensor dimension in that direction.

Focal Length FOV (1/2.3" sensor) Features Application
2.1mm ~120 degrees Ultra-wide angle Panoramic, obstacle avoidance
3.6mm ~80 degrees Wide angle Navigation, SLAM
6mm ~50 degrees Standard Object detection
12mm ~25 degrees Narrow angle Long-range recognition
25mm ~12 degrees Telephoto Inspection, meter reading

Aperture and Depth of Field

Aperture (F-number) affects light intake and depth of field:

\[ \text{Depth of Field} \propto \frac{F \times d^2}{f^2} \]

Where \(F\) is the F-number, \(d\) is the focus distance, and \(f\) is the focal length.

  • Small F-number (large aperture): More light, shallow depth of field -- suitable for dark environments
  • Large F-number (small aperture): Less light, deep depth of field -- suitable for all-in-focus scenarios

Shutter Types

Type Principle Advantages Disadvantages Suitable For
Rolling Shutter Row-by-row exposure Low cost, high resolution Motion distortion (jello effect) Static/slow scenes
Global Shutter All pixels exposed simultaneously No motion distortion Higher cost, slightly more noise High-speed motion, VIO

Rolling Shutter Impact on SLAM

In visual SLAM, rolling shutter causes feature point position shifts, affecting pose estimation accuracy. Robots moving at high speed should prioritize global shutter cameras.

Detailed Vision Processing Pipeline

Hardware Pipeline

graph TB
    subgraph Sensor Side
        A[CMOS Sensor] --> B[Analog Signal]
        B --> C[ADC]
        C --> D[Raw Data<br>RAW Bayer]
    end

    subgraph ISP Processing
        D --> E[Black Level Correction]
        E --> F[Dead Pixel Correction]
        F --> G[Demosaicing]
        G --> H[White Balance]
        H --> I[Denoising]
        I --> J[Color Correction<br>CCM]
        J --> K[Gamma Correction]
        K --> L[Sharpening]
    end

    subgraph Output
        L --> M[RGB/YUV]
        M --> N[Encoding<br>MJPEG/H.264]
        N --> O[Transfer<br>CSI/USB]
    end

Software Processing Pipeline

import cv2
import numpy as np

class VisionPipeline:
    def __init__(self, camera_matrix, dist_coeffs):
        self.K = camera_matrix
        self.D = dist_coeffs
        # Pre-compute undistortion maps
        self.map1, self.map2 = cv2.initUndistortRectifyMap(
            self.K, self.D, None, self.K, (640, 480), cv2.CV_16SC2
        )

    def process(self, frame):
        # 1. Undistortion
        undistorted = cv2.remap(frame, self.map1, self.map2, 
                                cv2.INTER_LINEAR)
        # 2. Color space conversion
        gray = cv2.cvtColor(undistorted, cv2.COLOR_BGR2GRAY)
        # 3. Histogram equalization (handle lighting changes)
        enhanced = cv2.equalizeHist(gray)
        # 4. Feature extraction or feed to neural network
        return enhanced

Indoor vs. Outdoor Environments

Challenge Indoor Outdoor
Lighting Relatively stable Extremely variable (shadows, direct sunlight)
Dynamic Range ~60dB sufficient Need >100dB (HDR)
Distance 1-10m 1m-100m+
Weather Unaffected Rain, fog, dust
Depth Camera IR projection usable IR overwhelmed by sunlight
Recommended Sensor RGB-D (RealSense) Stereo/LiDAR

Typical Robot Vision System Configurations

Indoor Mobile Robot

Component Model Use
Front depth camera RealSense D435i Obstacle avoidance + SLAM
Arm eye-in-hand camera RealSense D405 Grasp localization
Rear fisheye camera OV9281 Rear perception
Computing platform Jetson Orin Nano Inference + ROS2

Autonomous Driving Cart

Component Model Use
Front stereo camera ZED 2i Depth + visual SLAM
Surround fisheye cameras x4 IMX219 (wide angle) 360-degree perception
LiDAR Livox Mid-360 3D mapping
Computing platform Jetson Orin NX Multi-model inference

Selection Decision Flow

graph TD
    A[Vision Task Requirements] --> B{Need depth information?}
    B -->|No| C{Resolution requirement?}
    B -->|Yes| D{Operating environment?}

    C -->|>4K| E[Industrial Camera<br>GigE Vision]
    C -->|1080p| F[USB Camera<br>IMX477/C920]
    C -->|720p or below| G[CSI Camera<br>IMX219]

    D -->|Indoor| H{Depth range?}
    D -->|Outdoor| I[Stereo Camera<br>ZED 2i]

    H -->|<1m| J[Structured Light<br>D405]
    H -->|1-10m| K[Active Stereo<br>D435i/D455]
    H -->|>10m| L[LiDAR+RGB]

Summary

  1. Cameras are among the most important sensors for robots; selection must match the specific task
  2. CSI interface has the lowest latency for embedded use; USB is flexible; GigE suits long distances
  3. Resolution and frame rate must be balanced under bandwidth constraints
  4. Global shutter is critical for high-speed motion and VIO
  5. Use RGB-D indoors, use stereo or LiDAR-assisted outdoors
  6. The vision pipeline spans the complete chain from optics to software

References


评论 #