GPU and Parallel Computing

Overview

The GPU (Graphics Processing Unit) was originally designed for graphics rendering, but its massively parallel architecture has made it the ideal accelerator for deep learning inference and sensor data processing. In robot systems, the GPU handles computationally intensive tasks such as visual perception, point cloud processing, and neural network inference.

GPU Architecture Fundamentals

CPU vs. GPU: Design Philosophy

Feature	CPU	GPU
Core Count	4-16 (large cores)	Hundreds to thousands (small cores)
Clock Frequency	2-5 GHz	1-2 GHz
Cache	Large (tens of MB L3)	Small (a few MB L2)
Control Logic	Complex (branch prediction, out-of-order execution)	Simple
Suitable Tasks	Complex logic, low latency	Massively parallel, high throughput
Design Goal	Minimize latency	Maximize throughput

CPU: Few large cores, latency-optimized
+----------------+  +----------------+
|   Large Core 1 |  |   Large Core 2 |
| [Control] [ALU]|  | [Control] [ALU]|
| [L1 Cache]     |  | [L1 Cache]     |
+----------------+  +----------------+
        +--------------------+
        |   Large L3 Cache   |
        +--------------------+

GPU: Many small cores, throughput-optimized
+----++----++----++----++----++----++----++----+
|core||core||core||core||core||core||core||core| ...
+----++----++----++----++----++----++----++----+

NVIDIA GPU Architecture

Core organizational structure of NVIDIA GPUs:

Streaming Multiprocessor (SM)

The SM is the fundamental compute unit of a GPU. Each SM contains:

CUDA cores (FP32): Execute floating-point operations
Tensor cores: Dedicated matrix operation accelerators
Shared Memory: Low-latency storage shared among threads within an SM
Register file: Large register files to support thread contexts
Warp schedulers: Manage warp execution

graph TB
    subgraph GPU
        subgraph SM_0[SM 0]
            A1[CUDA Cores x128]
            A2[Tensor Cores x4]
            A3[Shared Memory 128KB]
            A4[Warp Schedulers x4]
        end
        subgraph SM_1[SM 1]
            B1[CUDA Cores x128]
            B2[Tensor Cores x4]
            B3[Shared Memory 128KB]
            B4[Warp Schedulers x4]
        end
        SM_N[... SM N]
        L2[L2 Cache]
        MEM[Global Memory / VRAM]
    end

    SM_0 --> L2
    SM_1 --> L2
    SM_N --> L2
    L2 --> MEM

SIMT Execution Model

GPUs employ the SIMT (Single Instruction, Multiple Threads) execution model:

Threads execute in groups of Warps (32 threads)
All threads in the same warp execute the same instruction
But can operate on different data (different addresses, different registers)

\[ \text{Total GPU Threads} = \text{Number of SMs} \times \text{Max Threads per SM} \]

Warp Divergence

When threads within a warp encounter conditional branches and take different paths, the GPU must serially execute both paths. This severely degrades efficiency:

if (threadIdx.x % 2 == 0) {
    // Even threads execute here
} else {
    // Odd threads execute here
}
// Worst case: performance halved

CUDA Programming Fundamentals

Thread Hierarchy

CUDA uses a hierarchical thread organization:

Level	Description	Hardware Mapping
Thread	Smallest execution unit	CUDA core
Warp	32 threads	Warp scheduling unit within an SM
Block	Multiple warps	Mapped to one SM
Grid	Multiple blocks	Entire GPU

Memory Hierarchy

Memory Type	Size	Latency	Scope
Registers	~256KB/SM	1 cycle	Per thread
Shared Memory	64-228KB/SM	~5 cycles	Shared within a block
L1 Cache	Shared with shared memory	~30 cycles	SM
L2 Cache	Several MB	~200 cycles	Entire GPU
Global Memory (VRAM)	4-80GB	~400 cycles	Entire GPU

CUDA Kernel Example

// Vector addition - the most basic CUDA kernel
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Image Processing Example

// Image grayscale conversion - robot vision preprocessing
__global__ void rgb2gray(unsigned char* rgb, unsigned char* gray, 
                         int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        int idx = y * width + x;
        int rgb_idx = idx * 3;
        // ITU-R BT.601 standard
        gray[idx] = (unsigned char)(0.299f * rgb[rgb_idx] + 
                                     0.587f * rgb[rgb_idx+1] + 
                                     0.114f * rgb[rgb_idx+2]);
    }
}

// 2D grid launch
dim3 block(16, 16);
dim3 grid((width+15)/16, (height+15)/16);
rgb2gray<<<grid, block>>>(d_rgb, d_gray, width, height);

Parallel Speedup Theory

Theoretical Speedup

In the ideal case, the speedup with \(N\) processing units:

\[ S_{\text{ideal}} = N \]

Actual speedup is limited by:

\[ S_{\text{actual}} = \frac{T_{\text{serial}}}{T_{\text{parallel}}} = \frac{T_{\text{serial}}}{T_{\text{serial\_part}} + \frac{T_{\text{parallel\_part}}}{N} + T_{\text{overhead}}} \]

Where \(T_{\text{overhead}}\) includes:

Thread creation and synchronization overhead
Memory copy overhead (CPU <-> GPU)
Load imbalance

Gustafson's Law

Unlike Amdahl's Law (fixed problem size), Gustafson's Law considers scaling the problem size as the number of processors increases:

\[ S_G = N - \alpha(N-1) \]

Where \(\alpha\) is the fraction of the serial portion. This explains why GPUs are so effective for large-scale data processing.

TensorRT Inference Acceleration

TensorRT Optimization Pipeline

TensorRT is NVIDIA's deep learning inference optimizer:

graph LR
    A[PyTorch/ONNX Model] --> B[TensorRT Parsing]
    B --> C[Layer Fusion]
    C --> D[Precision Calibration<br>FP16/INT8]
    D --> E[Kernel Auto-Tuning]
    E --> F[Optimized Engine<br>TRT Engine]

Key Optimization Techniques

Technique	Description	Typical Speedup
Layer Fusion	Merge Conv+BN+ReLU into one kernel	1.5-2x
FP16 Inference	Half-precision floating point	2-3x
INT8 Quantization	8-bit integer inference	3-5x
Dynamic Batching	Automatically select optimal batch size	1.2-1.5x

Inference Performance Examples

Typical inference performance on Jetson Orin NX:

Model	FP32	FP16	INT8	Use Case
YOLOv8n	45 FPS	120 FPS	200 FPS	Object detection
YOLOv8s	25 FPS	70 FPS	130 FPS	Object detection
MobileNetV3	200 FPS	500 FPS	800 FPS	Image classification
PointPillars	15 FPS	35 FPS	60 FPS	3D object detection

Jetson GPU Specifications Comparison

Feature	Orin Nano	Orin NX	AGX Orin
GPU Architecture	Ampere	Ampere	Ampere
CUDA Cores	1024	2048	2048
Tensor Cores	32	64	64
GPU Frequency	625 MHz	918 MHz	1.3 GHz
AI Performance (INT8)	40 TOPS	100 TOPS	275 TOPS
Memory	8GB shared	8/16GB shared	32/64GB shared
Power	7-15W	10-25W	15-60W

GPU Programming Best Practices

1. Maximize Occupancy

Ensure enough threads to hide memory latency:

\[ \text{Occupancy} = \frac{\text{Active Warps}}{\text{Max Warps per SM}} \]

Target occupancy > 50%.

2. Coalesced Memory Access

Adjacent threads access adjacent memory addresses:

// Good: Coalesced access
C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];

// Bad: Strided access
C[threadIdx.x] = A[threadIdx.x * stride];

3. Minimize CPU-GPU Data Transfers

# Bad: Frequent copies
for frame in camera_stream:
    gpu_frame = frame.to('cuda')     # CPU -> GPU
    result = model(gpu_frame)         # GPU inference
    cpu_result = result.to('cpu')     # GPU -> CPU

# Good: Pipelined using CUDA Streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
# Transfer data on stream1 while executing inference on stream2

4. Leverage Tensor Cores

Use Tensor cores to accelerate matrix operations:

\[ D = A \times B + C \quad \text{(Matrix multiply-add)} \]

Tensor cores complete a \(4 \times 4\) matrix multiply-add in a single clock cycle.

Enabling Tensor Cores in PyTorch

# Use FP16 automatic mixed precision
with torch.cuda.amp.autocast():
    output = model(input)

Robot Vision Processing Pipeline

graph LR
    A[Camera<br>CSI/USB] --> B[CPU<br>Image Decoding]
    B --> C[GPU<br>Preprocessing<br>resize/normalize]
    C --> D[GPU<br>Neural Network Inference<br>Detection/Segmentation]
    D --> E[GPU<br>Post-processing<br>NMS/Decoding]
    E --> F[CPU<br>Result Publishing<br>ROS2 Topic]

Using GStreamer + DeepStream enables a full GPU pipeline, avoiding data copies between CPU and GPU.

Summary

GPUs compensate for single-thread performance limitations through massive parallelism
The SIMT model requires avoiding warp divergence
Memory hierarchy and coalesced access are critical for GPU performance
TensorRT can boost inference performance by 3-5x
Minimizing CPU-GPU data transfers is key to optimization
The Jetson series provides AI performance options ranging from 40 to 275 TOPS for robotics

References

NVIDIA CUDA Programming Guide: https://docs.nvidia.com/cuda/
NVIDIA TensorRT Documentation: https://developer.nvidia.com/tensorrt
NVIDIA Jetson Developer Guide: https://developer.nvidia.com/embedded/
Kirk, D. B., & Hwu, W. Programming Massively Parallel Processors