Skip to content

GPU and Parallel Computing

Overview

The GPU (Graphics Processing Unit) was originally designed for graphics rendering, but its massively parallel architecture has made it the ideal accelerator for deep learning inference and sensor data processing. In robot systems, the GPU handles computationally intensive tasks such as visual perception, point cloud processing, and neural network inference.

GPU Architecture Fundamentals

CPU vs. GPU: Design Philosophy

Feature CPU GPU
Core Count 4-16 (large cores) Hundreds to thousands (small cores)
Clock Frequency 2-5 GHz 1-2 GHz
Cache Large (tens of MB L3) Small (a few MB L2)
Control Logic Complex (branch prediction, out-of-order execution) Simple
Suitable Tasks Complex logic, low latency Massively parallel, high throughput
Design Goal Minimize latency Maximize throughput
CPU: Few large cores, latency-optimized
+----------------+  +----------------+
|   Large Core 1 |  |   Large Core 2 |
| [Control] [ALU]|  | [Control] [ALU]|
| [L1 Cache]     |  | [L1 Cache]     |
+----------------+  +----------------+
        +--------------------+
        |   Large L3 Cache   |
        +--------------------+

GPU: Many small cores, throughput-optimized
+----++----++----++----++----++----++----++----+
|core||core||core||core||core||core||core||core| ...
+----++----++----++----++----++----++----++----+

NVIDIA GPU Architecture

Core organizational structure of NVIDIA GPUs:

Streaming Multiprocessor (SM)

The SM is the fundamental compute unit of a GPU. Each SM contains:

  • CUDA cores (FP32): Execute floating-point operations
  • Tensor cores: Dedicated matrix operation accelerators
  • Shared Memory: Low-latency storage shared among threads within an SM
  • Register file: Large register files to support thread contexts
  • Warp schedulers: Manage warp execution
graph TB
    subgraph GPU
        subgraph SM_0[SM 0]
            A1[CUDA Cores x128]
            A2[Tensor Cores x4]
            A3[Shared Memory 128KB]
            A4[Warp Schedulers x4]
        end
        subgraph SM_1[SM 1]
            B1[CUDA Cores x128]
            B2[Tensor Cores x4]
            B3[Shared Memory 128KB]
            B4[Warp Schedulers x4]
        end
        SM_N[... SM N]
        L2[L2 Cache]
        MEM[Global Memory / VRAM]
    end

    SM_0 --> L2
    SM_1 --> L2
    SM_N --> L2
    L2 --> MEM

SIMT Execution Model

GPUs employ the SIMT (Single Instruction, Multiple Threads) execution model:

  • Threads execute in groups of Warps (32 threads)
  • All threads in the same warp execute the same instruction
  • But can operate on different data (different addresses, different registers)
\[ \text{Total GPU Threads} = \text{Number of SMs} \times \text{Max Threads per SM} \]

Warp Divergence

When threads within a warp encounter conditional branches and take different paths, the GPU must serially execute both paths. This severely degrades efficiency:

if (threadIdx.x % 2 == 0) {
    // Even threads execute here
} else {
    // Odd threads execute here
}
// Worst case: performance halved

CUDA Programming Fundamentals

Thread Hierarchy

CUDA uses a hierarchical thread organization:

Level Description Hardware Mapping
Thread Smallest execution unit CUDA core
Warp 32 threads Warp scheduling unit within an SM
Block Multiple warps Mapped to one SM
Grid Multiple blocks Entire GPU

Memory Hierarchy

Memory Type Size Latency Scope
Registers ~256KB/SM 1 cycle Per thread
Shared Memory 64-228KB/SM ~5 cycles Shared within a block
L1 Cache Shared with shared memory ~30 cycles SM
L2 Cache Several MB ~200 cycles Entire GPU
Global Memory (VRAM) 4-80GB ~400 cycles Entire GPU

CUDA Kernel Example

// Vector addition - the most basic CUDA kernel
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Image Processing Example

// Image grayscale conversion - robot vision preprocessing
__global__ void rgb2gray(unsigned char* rgb, unsigned char* gray, 
                         int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        int idx = y * width + x;
        int rgb_idx = idx * 3;
        // ITU-R BT.601 standard
        gray[idx] = (unsigned char)(0.299f * rgb[rgb_idx] + 
                                     0.587f * rgb[rgb_idx+1] + 
                                     0.114f * rgb[rgb_idx+2]);
    }
}

// 2D grid launch
dim3 block(16, 16);
dim3 grid((width+15)/16, (height+15)/16);
rgb2gray<<<grid, block>>>(d_rgb, d_gray, width, height);

Parallel Speedup Theory

Theoretical Speedup

In the ideal case, the speedup with \(N\) processing units:

\[ S_{\text{ideal}} = N \]

Actual speedup is limited by:

\[ S_{\text{actual}} = \frac{T_{\text{serial}}}{T_{\text{parallel}}} = \frac{T_{\text{serial}}}{T_{\text{serial\_part}} + \frac{T_{\text{parallel\_part}}}{N} + T_{\text{overhead}}} \]

Where \(T_{\text{overhead}}\) includes:

  • Thread creation and synchronization overhead
  • Memory copy overhead (CPU <-> GPU)
  • Load imbalance

Gustafson's Law

Unlike Amdahl's Law (fixed problem size), Gustafson's Law considers scaling the problem size as the number of processors increases:

\[ S_G = N - \alpha(N-1) \]

Where \(\alpha\) is the fraction of the serial portion. This explains why GPUs are so effective for large-scale data processing.

TensorRT Inference Acceleration

TensorRT Optimization Pipeline

TensorRT is NVIDIA's deep learning inference optimizer:

graph LR
    A[PyTorch/ONNX Model] --> B[TensorRT Parsing]
    B --> C[Layer Fusion]
    C --> D[Precision Calibration<br>FP16/INT8]
    D --> E[Kernel Auto-Tuning]
    E --> F[Optimized Engine<br>TRT Engine]

Key Optimization Techniques

Technique Description Typical Speedup
Layer Fusion Merge Conv+BN+ReLU into one kernel 1.5-2x
FP16 Inference Half-precision floating point 2-3x
INT8 Quantization 8-bit integer inference 3-5x
Dynamic Batching Automatically select optimal batch size 1.2-1.5x

Inference Performance Examples

Typical inference performance on Jetson Orin NX:

Model FP32 FP16 INT8 Use Case
YOLOv8n 45 FPS 120 FPS 200 FPS Object detection
YOLOv8s 25 FPS 70 FPS 130 FPS Object detection
MobileNetV3 200 FPS 500 FPS 800 FPS Image classification
PointPillars 15 FPS 35 FPS 60 FPS 3D object detection

Jetson GPU Specifications Comparison

Feature Orin Nano Orin NX AGX Orin
GPU Architecture Ampere Ampere Ampere
CUDA Cores 1024 2048 2048
Tensor Cores 32 64 64
GPU Frequency 625 MHz 918 MHz 1.3 GHz
AI Performance (INT8) 40 TOPS 100 TOPS 275 TOPS
Memory 8GB shared 8/16GB shared 32/64GB shared
Power 7-15W 10-25W 15-60W

GPU Programming Best Practices

1. Maximize Occupancy

Ensure enough threads to hide memory latency:

\[ \text{Occupancy} = \frac{\text{Active Warps}}{\text{Max Warps per SM}} \]

Target occupancy > 50%.

2. Coalesced Memory Access

Adjacent threads access adjacent memory addresses:

// Good: Coalesced access
C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];

// Bad: Strided access
C[threadIdx.x] = A[threadIdx.x * stride];

3. Minimize CPU-GPU Data Transfers

# Bad: Frequent copies
for frame in camera_stream:
    gpu_frame = frame.to('cuda')     # CPU -> GPU
    result = model(gpu_frame)         # GPU inference
    cpu_result = result.to('cpu')     # GPU -> CPU

# Good: Pipelined using CUDA Streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
# Transfer data on stream1 while executing inference on stream2

4. Leverage Tensor Cores

Use Tensor cores to accelerate matrix operations:

\[ D = A \times B + C \quad \text{(Matrix multiply-add)} \]

Tensor cores complete a \(4 \times 4\) matrix multiply-add in a single clock cycle.

Enabling Tensor Cores in PyTorch

# Use FP16 automatic mixed precision
with torch.cuda.amp.autocast():
    output = model(input)

Robot Vision Processing Pipeline

graph LR
    A[Camera<br>CSI/USB] --> B[CPU<br>Image Decoding]
    B --> C[GPU<br>Preprocessing<br>resize/normalize]
    C --> D[GPU<br>Neural Network Inference<br>Detection/Segmentation]
    D --> E[GPU<br>Post-processing<br>NMS/Decoding]
    E --> F[CPU<br>Result Publishing<br>ROS2 Topic]

Using GStreamer + DeepStream enables a full GPU pipeline, avoiding data copies between CPU and GPU.

Summary

  1. GPUs compensate for single-thread performance limitations through massive parallelism
  2. The SIMT model requires avoiding warp divergence
  3. Memory hierarchy and coalesced access are critical for GPU performance
  4. TensorRT can boost inference performance by 3-5x
  5. Minimizing CPU-GPU data transfers is key to optimization
  6. The Jetson series provides AI performance options ranging from 40 to 275 TOPS for robotics

References


评论 #