Overview of Robot Vision

Overview

Vision is one of the most important ways for robots to perceive the world. As the "eyes" of robots, cameras provide rich environmental information for navigation, manipulation, human-robot interaction, and other tasks. This section provides an overview of the basic components, camera types, and design considerations for robot vision systems.

The Role of Vision in Robotics

Core Applications

Application Domain	Task	Camera Requirements
Navigation	SLAM, obstacle avoidance, path planning	Depth information, wide FOV, high frame rate
Manipulation	Object detection, grasp pose estimation	High resolution, depth accuracy
Human-Robot Interaction	Human detection, gesture recognition	RGB HD, low latency
Inspection	Defect detection, meter reading	High resolution, color accuracy
Autonomous Driving	Lane detection, pedestrian detection	Wide dynamic range, high frame rate

Vision Processing Pipeline

graph LR
    A[Optical System<br>Lens+Sensor] --> B[Image Acquisition<br>ISP/Decoding]
    B --> C[Preprocessing<br>Undistortion/Enhancement]
    C --> D[Feature Extraction<br>Traditional/Deep Learning]
    D --> E[High-Level Understanding<br>Detection/Segmentation/Estimation]
    E --> F[Decision/Control<br>Planning/Action]

    style A fill:#fff3e0
    style B fill:#ffe0b2
    style C fill:#ffcc80
    style D fill:#ffb74d
    style E fill:#ffa726
    style F fill:#ff9800

Camera Type Overview

Classification by Information Acquired

Type	Output	Advantage	Disadvantage	Representative Product
RGB Camera	2D color image	Low cost, high resolution	No depth information	IMX219, C920
Depth Camera	RGB + depth map	Direct 3D acquisition	Limited outdoors, limited range	RealSense D435i
Stereo Camera	Left/right images -> depth	Passive ranging, outdoor capable	High computation	ZED 2i
Structured Light	Projected patterns -> depth	High precision	Indoor only, short range	Kinect Azure
ToF Camera	Time-of-flight -> depth	Fast, texture-independent	Low resolution	PMD Flexx2
Event Camera	Asynchronous brightness changes	Ultra-high temporal resolution	Immature ecosystem	DAVIS346
Panoramic Camera	360-degree image	Full coverage	Large distortion	Ricoh Theta

Classification by Interface

Interface	Bandwidth	Latency	Cable Length	CPU Usage	Suitable Scenario
CSI-2	2.5 Gbps/lane (4 lanes)	Lowest	<30cm	Very low (ISP hardware)	Embedded systems
USB 2.0	480 Mbps	Medium	<5m	Medium	Low-resolution cameras
USB 3.0	5 Gbps	Medium	<3m	Medium	Mainstream cameras
GigE Vision	1 Gbps	Medium	<100m	Low	Industrial cameras
10GigE	10 Gbps	Low	<100m	Low	High-speed industrial cameras

Resolution and Frame Rate

Resolution Selection

Resolution	Pixel Count	Data Size (RGB)	Typical FPS	Suitable Scenario
VGA (640x480)	0.3MP	0.9MB/frame	60-120fps	Real-time navigation, obstacle avoidance
720p (1280x720)	0.9MP	2.7MB/frame	30-60fps	General vision tasks
1080p (1920x1080)	2.1MP	6.2MB/frame	30fps	Object detection, recognition
4K (3840x2160)	8.3MP	24.9MB/frame	15-30fps	High-precision detection

Frame Rate Requirements

\[ \text{Minimum FPS} = \frac{v_{\text{max}}}{d_{\text{min}}} \times \frac{1}{\text{FOV}_{\text{pixel}}} \]

Where:

\(v_{\text{max}}\): Maximum robot speed
\(d_{\text{min}}\): Minimum object detection distance
\(\text{FOV}_{\text{pixel}}\): Field of view per pixel

Frame Rate Calculation Example

Mobile robot speed 1m/s, need to detect obstacles at 2m, horizontal FOV 60 degrees, resolution 640px:

\[\text{Per-pixel FOV} = \frac{60°}{640} \approx 0.094°\]

\[\text{Minimum FPS} = \frac{1}{2} \times \frac{1}{0.094° \times \pi/180°} \approx 305 \text{ fps}\]

In practice, since detection algorithms tolerate several pixels of motion blur, 30fps is usually sufficient.

Bandwidth Calculation

\[ \text{Bandwidth} = \text{Width} \times \text{Height} \times \text{Channels} \times \text{Bit Depth} \times \text{FPS} \]

Configuration	Raw Bandwidth	Compressed (MJPEG)
640x480 RGB @30fps	27.6 MB/s	~5 MB/s
1080p RGB @30fps	186 MB/s	~30 MB/s
1080p RGB @60fps	373 MB/s	~60 MB/s
4K RGB @30fps	746 MB/s	~100 MB/s

Key Optical Parameters

Focal Length and Field of View

Focal length \(f\) and sensor size determine the field of view (FOV):

\[ \text{FOV} = 2 \arctan\left(\frac{d}{2f}\right) \]

Where \(d\) is the sensor dimension in that direction.

Focal Length	FOV (1/2.3" sensor)	Features	Application
2.1mm	~120 degrees	Ultra-wide angle	Panoramic, obstacle avoidance
3.6mm	~80 degrees	Wide angle	Navigation, SLAM
6mm	~50 degrees	Standard	Object detection
12mm	~25 degrees	Narrow angle	Long-range recognition
25mm	~12 degrees	Telephoto	Inspection, meter reading

Aperture and Depth of Field

Aperture (F-number) affects light intake and depth of field:

\[ \text{Depth of Field} \propto \frac{F \times d^2}{f^2} \]

Where \(F\) is the F-number, \(d\) is the focus distance, and \(f\) is the focal length.

Small F-number (large aperture): More light, shallow depth of field -- suitable for dark environments
Large F-number (small aperture): Less light, deep depth of field -- suitable for all-in-focus scenarios

Shutter Types

Type	Principle	Advantages	Disadvantages	Suitable For
Rolling Shutter	Row-by-row exposure	Low cost, high resolution	Motion distortion (jello effect)	Static/slow scenes
Global Shutter	All pixels exposed simultaneously	No motion distortion	Higher cost, slightly more noise	High-speed motion, VIO

Rolling Shutter Impact on SLAM

In visual SLAM, rolling shutter causes feature point position shifts, affecting pose estimation accuracy. Robots moving at high speed should prioritize global shutter cameras.

Detailed Vision Processing Pipeline

Hardware Pipeline

graph TB
    subgraph Sensor Side
        A[CMOS Sensor] --> B[Analog Signal]
        B --> C[ADC]
        C --> D[Raw Data<br>RAW Bayer]
    end

    subgraph ISP Processing
        D --> E[Black Level Correction]
        E --> F[Dead Pixel Correction]
        F --> G[Demosaicing]
        G --> H[White Balance]
        H --> I[Denoising]
        I --> J[Color Correction<br>CCM]
        J --> K[Gamma Correction]
        K --> L[Sharpening]
    end

    subgraph Output
        L --> M[RGB/YUV]
        M --> N[Encoding<br>MJPEG/H.264]
        N --> O[Transfer<br>CSI/USB]
    end

Software Processing Pipeline

import cv2
import numpy as np

class VisionPipeline:
    def __init__(self, camera_matrix, dist_coeffs):
        self.K = camera_matrix
        self.D = dist_coeffs
        # Pre-compute undistortion maps
        self.map1, self.map2 = cv2.initUndistortRectifyMap(
            self.K, self.D, None, self.K, (640, 480), cv2.CV_16SC2
        )

    def process(self, frame):
        # 1. Undistortion
        undistorted = cv2.remap(frame, self.map1, self.map2, 
                                cv2.INTER_LINEAR)
        # 2. Color space conversion
        gray = cv2.cvtColor(undistorted, cv2.COLOR_BGR2GRAY)
        # 3. Histogram equalization (handle lighting changes)
        enhanced = cv2.equalizeHist(gray)
        # 4. Feature extraction or feed to neural network
        return enhanced

Indoor vs. Outdoor Environments

Challenge	Indoor	Outdoor
Lighting	Relatively stable	Extremely variable (shadows, direct sunlight)
Dynamic Range	~60dB sufficient	Need >100dB (HDR)
Distance	1-10m	1m-100m+
Weather	Unaffected	Rain, fog, dust
Depth Camera	IR projection usable	IR overwhelmed by sunlight
Recommended Sensor	RGB-D (RealSense)	Stereo/LiDAR

Typical Robot Vision System Configurations

Indoor Mobile Robot

Component	Model	Use
Front depth camera	RealSense D435i	Obstacle avoidance + SLAM
Arm eye-in-hand camera	RealSense D405	Grasp localization
Rear fisheye camera	OV9281	Rear perception
Computing platform	Jetson Orin Nano	Inference + ROS2

Autonomous Driving Cart

Component	Model	Use
Front stereo camera	ZED 2i	Depth + visual SLAM
Surround fisheye cameras x4	IMX219 (wide angle)	360-degree perception
LiDAR	Livox Mid-360	3D mapping
Computing platform	Jetson Orin NX	Multi-model inference

Selection Decision Flow

graph TD
    A[Vision Task Requirements] --> B{Need depth information?}
    B -->|No| C{Resolution requirement?}
    B -->|Yes| D{Operating environment?}

    C -->|>4K| E[Industrial Camera<br>GigE Vision]
    C -->|1080p| F[USB Camera<br>IMX477/C920]
    C -->|720p or below| G[CSI Camera<br>IMX219]

    D -->|Indoor| H{Depth range?}
    D -->|Outdoor| I[Stereo Camera<br>ZED 2i]

    H -->|<1m| J[Structured Light<br>D405]
    H -->|1-10m| K[Active Stereo<br>D435i/D455]
    H -->|>10m| L[LiDAR+RGB]

Summary

Cameras are among the most important sensors for robots; selection must match the specific task
CSI interface has the lowest latency for embedded use; USB is flexible; GigE suits long distances
Resolution and frame rate must be balanced under bandwidth constraints
Global shutter is critical for high-speed motion and VIO
Use RGB-D indoors, use stereo or LiDAR-assisted outdoors
The vision pipeline spans the complete chain from optics to software

References

Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision
Szeliski, R. Computer Vision: Algorithms and Applications
Intel RealSense Documentation: https://dev.intelrealsense.com/
OpenCV Documentation: https://docs.opencv.org/