Robot systems span multiple computational tiers: from microsecond-level real-time control to second-level AI inference. This article surveys embedded AI computing platforms, edge inference accelerators, cloud training resources, and heterogeneous computing architectures.
Computational Requirements by Tier
graph LR
subgraph RT["Real-time Control Layer"]
MCU[MCU / FPGA<br/>1kHz - 10kHz<br/>Motor control, sensor readout]
end
subgraph Edge["Edge Inference Layer"]
JETSON[Jetson / Edge AI<br/>30-100Hz<br/>Perception, planning, inference]
end
subgraph Cloud["Cloud Training Layer"]
GPU[GPU Cluster<br/>A100 / H100<br/>Model training, large-scale simulation]
end
MCU -- "EtherCAT / CAN<br/>< 1ms" --> JETSON
JETSON -- "WiFi / 5G<br/>10-100ms" --> GPU
style RT fill:#ffebee
style Edge fill:#e8f5e9
style Cloud fill:#e3f2fd
| Tier |
Latency Requirement |
Computation Type |
Typical Hardware |
| Real-time control |
<1ms (>1kHz) |
Fixed algorithms, PID, state machines |
MCU (STM32), FPGA |
| Perception inference |
10-100ms (10-100Hz) |
Neural network inference, SLAM |
Jetson, edge AI |
| High-level planning |
100ms-1s |
Motion planning, task planning |
Jetson AGX, industrial PC |
| Model training |
Hours to days |
Large-scale RL/VLA training |
GPU cluster |
NVIDIA Jetson Series
Jetson is the de facto standard for robot AI computing. The current main product line is the Orin series.
Orin Series Comparison
| Model |
GPU |
CPU |
AI Performance |
Memory |
Storage Interface |
Power |
Ref. Price |
| Orin Nano 4GB |
512 CUDA |
6-core A78AE |
20 TOPS |
4GB LPDDR5 |
NVMe |
7-15W |
~$199 |
| Orin Nano 8GB |
1024 CUDA |
6-core A78AE |
40 TOPS |
8GB LPDDR5 |
NVMe |
7-15W |
~$299 |
| Orin NX 8GB |
1024 CUDA |
6-core A78AE |
70 TOPS |
8GB LPDDR5 |
NVMe |
10-25W |
~$399 |
| Orin NX 16GB |
1024 CUDA |
8-core A78AE |
100 TOPS |
16GB LPDDR5 |
NVMe |
10-25W |
~$599 |
| AGX Orin 32GB |
1792 CUDA |
8-core A78AE + 4-core A78 |
200 TOPS |
32GB LPDDR5 |
NVMe |
15-50W |
~$999 |
| AGX Orin 64GB |
2048 CUDA |
12-core A78AE |
275 TOPS |
64GB LPDDR5 |
NVMe |
15-60W |
~$1,599 |
Jetson Nano (Legacy)
The original Jetson Nano (128 CUDA Maxwell, 472 GFLOPS FP16, 4GB LPDDR4, 5-10W, ~$149) is gradually being replaced by the Orin Nano, but remains widely used in educational settings.
JetPack SDK
JetPack is the complete SDK for Jetson, including:
| Component |
Description |
| L4T |
Linux for Tegra (Ubuntu-based) |
| CUDA |
GPU computing |
| cuDNN |
Deep learning acceleration |
| TensorRT |
Inference optimization engine (FP16/INT8 quantization, layer fusion) |
| VPI |
Vision Programming Interface |
| Multimedia API |
Hardware codec |
| DeepStream |
Video analytics pipeline |
| Isaac ROS |
Robot-specific ROS 2 acceleration packages |
Version Mapping
| JetPack |
L4T |
CUDA |
Supported Hardware |
| 5.1.x |
R35.x |
11.4 |
All Orin series |
| 6.0+ |
R36.x |
12.2+ |
All Orin series |
Deployment Optimization
# Set high-performance mode
sudo nvpmodel -m 0 # MAXN mode (maximum performance)
sudo jetson_clocks # Lock maximum frequency
# TensorRT model optimization
trtexec --onnx=model.onnx \
--saveEngine=model.engine \
--fp16 \ # FP16 quantization
--workspace=4096 # 4GB workspace
# Python TensorRT inference example
import tensorrt as trt
import pycuda.driver as cuda
# Load engine
runtime = trt.Runtime(trt.Logger())
with open("model.engine", "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
# Bind inputs/outputs, allocate memory, execute inference
# Typical latency: ~5ms on Orin NX (INT8), ~20ms on Orin Nano (FP16)
Jetson Selection Recommendations
| Application |
Recommended Model |
Rationale |
| Education/entry-level |
Orin Nano 8GB |
Low cost, sufficient |
| Service robot (navigation+avoidance) |
Orin NX 16GB |
Balanced performance and power |
| Robot arm manipulation (VLA inference) |
AGX Orin 32GB |
Large models require large memory |
| Humanoid robot |
AGX Orin 64GB |
Multi-modal perception + whole-body control |
| Autonomous driving prototype |
AGX Orin 64GB |
Multi-sensor fusion |
Next Generation: Jetson Thor
NVIDIA's announced next-generation robot computing platform based on the Blackwell GPU architecture:
| Metric |
AGX Orin |
Thor (Expected) |
| AI Performance |
275 TOPS |
800+ TOPS |
| Memory |
64GB |
128GB |
| GPU Architecture |
Ampere |
Blackwell |
| Target Application |
General robotics |
Humanoid robot foundation models |
Intel Movidius (Integrated into OpenVINO)
| Feature |
Description |
| Chip |
Myriad X VPU |
| Performance |
~4 TOPS |
| Power |
~1W |
| Features |
Ultra-low power, USB accelerator stick form factor |
| SDK |
Intel OpenVINO |
| Status |
Standalone chips no longer produced, integrated into Intel platforms |
Google Coral TPU
| Feature |
Description |
| Chip |
Edge TPU |
| Performance |
4 TOPS (INT8) |
| Power |
~2W |
| Form Factor |
USB accelerator / Dev Board / M.2 module |
| SDK |
TensorFlow Lite |
| Features |
INT8-dedicated, extremely low inference latency |
Comparison
| Platform |
Performance |
Power |
Ecosystem |
Flexibility |
Price |
| Jetson Orin Nano |
40 TOPS |
15W |
CUDA/TensorRT |
Very high |
$299 |
| Coral TPU |
4 TOPS |
2W |
TF Lite |
Low |
$60 |
| OpenVINO (Intel) |
~5 TOPS |
5W |
OpenVINO |
Medium |
$80 |
| Hailo-8 |
26 TOPS |
3W |
Hailo SDK |
Medium |
~$100 |
| Rockchip RK3588 |
6 TOPS |
5-10W |
RKNN |
Medium |
~$100 |
Onboard vs Cloud Computing
In robot systems, different tasks have different latency requirements, necessitating proper partitioning of local and cloud computation.
Latency Requirements and Compute Location
| Control Tier |
Frequency Req. |
Latency Tolerance |
Compute Location |
Example |
| Low-level motor control |
1-10 kHz |
<1 ms |
Local (FPGA/MCU) |
PID torque control |
| Mid-level motion control |
100-500 Hz |
2-10 ms |
Local (Jetson/MCU) |
Trajectory tracking |
| High-level policy inference |
10-50 Hz |
20-100 ms |
Local (Jetson) |
Visual policy inference |
| Language understanding/planning |
0.1-1 Hz |
100ms-seconds |
Cloud/local both viable |
VLM task planning |
| Training/fine-tuning |
Offline |
Unconstrained |
Cloud |
Policy model training |
Key Principle: 1kHz-level control loops must run locally and never depend on network; 10Hz-level high-level planning can consider cloud assistance but needs a local fallback mechanism.
Cloud / Workstation GPUs
Training Card Comparison
| GPU |
VRAM |
FP16 TFLOPS |
Interconnect |
Price |
Typical Use |
| RTX 4090 |
24GB |
330 |
PCIe |
~$1,600 |
Personal research |
| A100 80GB |
80GB |
312 |
NVLink |
~$15,000 |
Lab training |
| H100 80GB |
80GB |
990 |
NVLink/NVSwitch |
~$30,000 |
Large-scale training |
| H200 |
141GB HBM3e |
990 |
NVLink |
~$35,000 |
VLA large models |
GPU Requirements for Robot AI
| Task |
Model Scale |
Minimum GPU |
Recommended GPU |
| Small RL policy training |
<10M params |
RTX 3060 |
RTX 4090 |
| Isaac Lab parallel training |
— |
RTX 3080 |
A100 |
| VLA fine-tuning (7B) |
7B params |
A100 40GB |
2x A100 80GB |
| VLA pretraining |
7B+ params |
8x A100 |
8x H100 |
| Real-time VLA inference |
3B params |
Jetson AGX Orin |
— |
| Real-time small model inference |
<100M params |
Jetson Orin NX |
— |
Inference Optimization Pipeline
Common optimization techniques for deploying models at the edge:
- Quantization: FP32 -> FP16 -> INT8, reducing computational requirements 2-8x
- Distillation: Transfer large model knowledge to smaller models
- Pruning: Remove redundant weights
- TensorRT optimization: Layer fusion, memory optimization, 2-5x inference speedup
Training (A100/H100, FP32)
| Export ONNX
| TensorRT conversion (FP16/INT8)
| Deploy to Jetson
Inference (Jetson Orin, INT8): 50ms -> 8ms, 400MB -> 100MB
FPGA in Robotics
FPGAs are used for real-time control scenarios requiring microsecond-level deterministic latency.
Typical Applications
| Application |
Description |
Latency Requirement |
| Motor FOC control |
Field-oriented control, PWM generation |
<10us |
| EtherCAT master |
Real-time industrial communication |
<1ms |
| Sensor preprocessing |
Encoder counting, ADC sampling |
<1us |
| Safety monitoring |
Force/position limits, emergency stop |
<10us |
| Platform |
Chip |
Features |
Price |
Application |
| Xilinx Zynq-7000 |
ARM + FPGA |
SoC, embedded + logic |
~$200 |
Motor control |
| Intel Cyclone V |
ARM + FPGA |
Low-cost SoC |
~$150 |
Education/prototype |
| Xilinx Kria KV260 |
Zynq UltraScale+ |
Vision AI + real-time control |
~$250 |
Robot vision |
| Lattice iCE40 |
— |
Ultra-low power, open-source toolchain |
~$50 |
Simple control logic |
FPGA vs MCU Comparison
| Feature |
FPGA |
MCU (STM32, etc.) |
| Latency |
<1 us |
1-100 us |
| Parallelism |
True hardware parallelism |
Pseudo-parallelism (interrupts) |
| Development difficulty |
High (HDL/Verilog) |
Low (C/C++) |
| Flexibility |
Hardware reconfigurable |
Fixed architecture |
| Cost |
Higher |
Low |
| Typical scenario |
Multi-axis synchronous control |
Single-axis PID control |
Power Budget for Mobile Robots
Mobile robots have limited battery capacity; computing platform power directly affects battery life.
Typical Power Distribution (Mobile Manipulation Robot)
| Subsystem |
Power Share |
Typical Power |
| Mobile base motors |
40-50% |
50-200 W |
| Robot arm motors |
20-30% |
20-100 W |
| Computing platform |
10-20% |
10-60 W |
| Sensors |
5-10% |
5-20 W |
| Communication |
2-5% |
2-10 W |
Power Optimization Strategies
- Dynamic frequency scaling: Lower GPU/CPU frequency when idle (
nvpmodel to switch power modes)
- On-demand model loading: Switch to lightweight models when complex inference is unnecessary
- Sensor sleep: Non-essential sensors can sample intermittently
- Mixed-precision inference: Use lower precision for non-critical tasks (INT8 reduces power ~40% vs FP16)
Heterogeneous Computing Architecture
Real robot systems typically employ heterogeneous computing architectures with multi-tier hardware collaboration:
| Tier |
Hardware |
Communication |
Function |
| Level 0 |
MCU (STM32H7) |
CAN/SPI |
Motor control (10kHz) |
| Level 1 |
FPGA (optional) |
EtherCAT |
Real-time safety monitoring |
| Level 2 |
Jetson (Orin) |
Ethernet/USB3 |
AI inference + ROS2 |
| Level 3 |
Cloud GPU |
WiFi/5G |
Training, remote monitoring |
+------------------------------------------+
| Cloud (GPU Cluster) |
| Train VLA / Large-scale sim |
+-----------------+------------------------+
| WiFi / 5G
+-----------------+------------------------+
| Jetson AGX Orin (ROS2) |
| Perception | SLAM | Planning | VLA |
+----+----------+----------+---------------+
| USB3 | Ethernet | EtherCAT
+----+----+ +---+---+ +---+--------------+
| Camera | | LiDAR | | MCU (STM32) |
| D435i | | Mid360| | Motor FOC 10kHz |
+---------+ +-------+ | Encoder readout |
| Safety limits |
+------------------+
Selection Decision Process
- Determine inference model size: Parameter count determines minimum memory requirements
- Determine inference frequency: Control >100Hz requires high compute; perception at 30Hz is less demanding
- Power budget: Mobile robots are strictly constrained; fixed installations are more relaxed
- ROS2 support: Jetson ecosystem is most comprehensive
- Cost constraints: Orin Nano for education, AGX Orin for research
For a more detailed selection framework, see Hardware Selection Guide.