GPU and Parallel Computing
Overview
The GPU (Graphics Processing Unit) was originally designed for graphics rendering, but its massively parallel architecture has made it the ideal accelerator for deep learning inference and sensor data processing. In robot systems, the GPU handles computationally intensive tasks such as visual perception, point cloud processing, and neural network inference.
GPU Architecture Fundamentals
CPU vs. GPU: Design Philosophy
| Feature | CPU | GPU |
|---|---|---|
| Core Count | 4-16 (large cores) | Hundreds to thousands (small cores) |
| Clock Frequency | 2-5 GHz | 1-2 GHz |
| Cache | Large (tens of MB L3) | Small (a few MB L2) |
| Control Logic | Complex (branch prediction, out-of-order execution) | Simple |
| Suitable Tasks | Complex logic, low latency | Massively parallel, high throughput |
| Design Goal | Minimize latency | Maximize throughput |
CPU: Few large cores, latency-optimized
+----------------+ +----------------+
| Large Core 1 | | Large Core 2 |
| [Control] [ALU]| | [Control] [ALU]|
| [L1 Cache] | | [L1 Cache] |
+----------------+ +----------------+
+--------------------+
| Large L3 Cache |
+--------------------+
GPU: Many small cores, throughput-optimized
+----++----++----++----++----++----++----++----+
|core||core||core||core||core||core||core||core| ...
+----++----++----++----++----++----++----++----+
NVIDIA GPU Architecture
Core organizational structure of NVIDIA GPUs:
Streaming Multiprocessor (SM)
The SM is the fundamental compute unit of a GPU. Each SM contains:
- CUDA cores (FP32): Execute floating-point operations
- Tensor cores: Dedicated matrix operation accelerators
- Shared Memory: Low-latency storage shared among threads within an SM
- Register file: Large register files to support thread contexts
- Warp schedulers: Manage warp execution
graph TB
subgraph GPU
subgraph SM_0[SM 0]
A1[CUDA Cores x128]
A2[Tensor Cores x4]
A3[Shared Memory 128KB]
A4[Warp Schedulers x4]
end
subgraph SM_1[SM 1]
B1[CUDA Cores x128]
B2[Tensor Cores x4]
B3[Shared Memory 128KB]
B4[Warp Schedulers x4]
end
SM_N[... SM N]
L2[L2 Cache]
MEM[Global Memory / VRAM]
end
SM_0 --> L2
SM_1 --> L2
SM_N --> L2
L2 --> MEM
SIMT Execution Model
GPUs employ the SIMT (Single Instruction, Multiple Threads) execution model:
- Threads execute in groups of Warps (32 threads)
- All threads in the same warp execute the same instruction
- But can operate on different data (different addresses, different registers)
Warp Divergence
When threads within a warp encounter conditional branches and take different paths, the GPU must serially execute both paths. This severely degrades efficiency:
if (threadIdx.x % 2 == 0) {
// Even threads execute here
} else {
// Odd threads execute here
}
// Worst case: performance halved
CUDA Programming Fundamentals
Thread Hierarchy
CUDA uses a hierarchical thread organization:
| Level | Description | Hardware Mapping |
|---|---|---|
| Thread | Smallest execution unit | CUDA core |
| Warp | 32 threads | Warp scheduling unit within an SM |
| Block | Multiple warps | Mapped to one SM |
| Grid | Multiple blocks | Entire GPU |
Memory Hierarchy
| Memory Type | Size | Latency | Scope |
|---|---|---|---|
| Registers | ~256KB/SM | 1 cycle | Per thread |
| Shared Memory | 64-228KB/SM | ~5 cycles | Shared within a block |
| L1 Cache | Shared with shared memory | ~30 cycles | SM |
| L2 Cache | Several MB | ~200 cycles | Entire GPU |
| Global Memory (VRAM) | 4-80GB | ~400 cycles | Entire GPU |
CUDA Kernel Example
// Vector addition - the most basic CUDA kernel
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
Image Processing Example
// Image grayscale conversion - robot vision preprocessing
__global__ void rgb2gray(unsigned char* rgb, unsigned char* gray,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
int idx = y * width + x;
int rgb_idx = idx * 3;
// ITU-R BT.601 standard
gray[idx] = (unsigned char)(0.299f * rgb[rgb_idx] +
0.587f * rgb[rgb_idx+1] +
0.114f * rgb[rgb_idx+2]);
}
}
// 2D grid launch
dim3 block(16, 16);
dim3 grid((width+15)/16, (height+15)/16);
rgb2gray<<<grid, block>>>(d_rgb, d_gray, width, height);
Parallel Speedup Theory
Theoretical Speedup
In the ideal case, the speedup with \(N\) processing units:
Actual speedup is limited by:
Where \(T_{\text{overhead}}\) includes:
- Thread creation and synchronization overhead
- Memory copy overhead (CPU <-> GPU)
- Load imbalance
Gustafson's Law
Unlike Amdahl's Law (fixed problem size), Gustafson's Law considers scaling the problem size as the number of processors increases:
Where \(\alpha\) is the fraction of the serial portion. This explains why GPUs are so effective for large-scale data processing.
TensorRT Inference Acceleration
TensorRT Optimization Pipeline
TensorRT is NVIDIA's deep learning inference optimizer:
graph LR
A[PyTorch/ONNX Model] --> B[TensorRT Parsing]
B --> C[Layer Fusion]
C --> D[Precision Calibration<br>FP16/INT8]
D --> E[Kernel Auto-Tuning]
E --> F[Optimized Engine<br>TRT Engine]
Key Optimization Techniques
| Technique | Description | Typical Speedup |
|---|---|---|
| Layer Fusion | Merge Conv+BN+ReLU into one kernel | 1.5-2x |
| FP16 Inference | Half-precision floating point | 2-3x |
| INT8 Quantization | 8-bit integer inference | 3-5x |
| Dynamic Batching | Automatically select optimal batch size | 1.2-1.5x |
Inference Performance Examples
Typical inference performance on Jetson Orin NX:
| Model | FP32 | FP16 | INT8 | Use Case |
|---|---|---|---|---|
| YOLOv8n | 45 FPS | 120 FPS | 200 FPS | Object detection |
| YOLOv8s | 25 FPS | 70 FPS | 130 FPS | Object detection |
| MobileNetV3 | 200 FPS | 500 FPS | 800 FPS | Image classification |
| PointPillars | 15 FPS | 35 FPS | 60 FPS | 3D object detection |
Jetson GPU Specifications Comparison
| Feature | Orin Nano | Orin NX | AGX Orin |
|---|---|---|---|
| GPU Architecture | Ampere | Ampere | Ampere |
| CUDA Cores | 1024 | 2048 | 2048 |
| Tensor Cores | 32 | 64 | 64 |
| GPU Frequency | 625 MHz | 918 MHz | 1.3 GHz |
| AI Performance (INT8) | 40 TOPS | 100 TOPS | 275 TOPS |
| Memory | 8GB shared | 8/16GB shared | 32/64GB shared |
| Power | 7-15W | 10-25W | 15-60W |
GPU Programming Best Practices
1. Maximize Occupancy
Ensure enough threads to hide memory latency:
Target occupancy > 50%.
2. Coalesced Memory Access
Adjacent threads access adjacent memory addresses:
// Good: Coalesced access
C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];
// Bad: Strided access
C[threadIdx.x] = A[threadIdx.x * stride];
3. Minimize CPU-GPU Data Transfers
# Bad: Frequent copies
for frame in camera_stream:
gpu_frame = frame.to('cuda') # CPU -> GPU
result = model(gpu_frame) # GPU inference
cpu_result = result.to('cpu') # GPU -> CPU
# Good: Pipelined using CUDA Streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
# Transfer data on stream1 while executing inference on stream2
4. Leverage Tensor Cores
Use Tensor cores to accelerate matrix operations:
Tensor cores complete a \(4 \times 4\) matrix multiply-add in a single clock cycle.
Enabling Tensor Cores in PyTorch
# Use FP16 automatic mixed precision
with torch.cuda.amp.autocast():
output = model(input)
Robot Vision Processing Pipeline
graph LR
A[Camera<br>CSI/USB] --> B[CPU<br>Image Decoding]
B --> C[GPU<br>Preprocessing<br>resize/normalize]
C --> D[GPU<br>Neural Network Inference<br>Detection/Segmentation]
D --> E[GPU<br>Post-processing<br>NMS/Decoding]
E --> F[CPU<br>Result Publishing<br>ROS2 Topic]
Using GStreamer + DeepStream enables a full GPU pipeline, avoiding data copies between CPU and GPU.
Summary
- GPUs compensate for single-thread performance limitations through massive parallelism
- The SIMT model requires avoiding warp divergence
- Memory hierarchy and coalesced access are critical for GPU performance
- TensorRT can boost inference performance by 3-5x
- Minimizing CPU-GPU data transfers is key to optimization
- The Jetson series provides AI performance options ranging from 40 to 275 TOPS for robotics
References
- NVIDIA CUDA Programming Guide: https://docs.nvidia.com/cuda/
- NVIDIA TensorRT Documentation: https://developer.nvidia.com/tensorrt
- NVIDIA Jetson Developer Guide: https://developer.nvidia.com/embedded/
- Kirk, D. B., & Hwu, W. Programming Massively Parallel Processors