Memory and Storage
Overview
Memory and storage systems form the data infrastructure of robot computing platforms. From high-speed caches to persistent storage, different tiers of storage media achieve a balance among capacity, speed, and cost. Understanding the memory hierarchy is essential for optimizing data throughput in robot systems.
SRAM vs. DRAM
SRAM (Static Random-Access Memory)
SRAM uses six transistors (6T) to construct a single storage cell:
| Feature | Value |
|---|---|
| Transistors/bit | 6 |
| Access Speed | 1-2ns |
| Requires Refresh | No |
| Power Consumption | Low (static) |
| Density | Low |
| Cost | High |
| Use | CPU caches (L1/L2/L3) |
DRAM (Dynamic Random-Access Memory)
DRAM uses one transistor + one capacitor (1T1C):
| Feature | Value |
|---|---|
| Transistors/bit | 1 + 1 capacitor |
| Access Speed | 50-100ns |
| Requires Refresh | Yes (~every 64ms) |
| Power Consumption | Higher (refresh) |
| Density | High |
| Cost | Low |
| Use | Main memory (RAM) |
DRAM Type Evolution
| Type | Bandwidth | Voltage | Features | Application |
|---|---|---|---|---|
| DDR4 | 25.6 GB/s | 1.2V | Mature and stable | PCs, servers |
| DDR5 | 51.2 GB/s | 1.1V | Higher bandwidth | Next-gen PCs |
| LPDDR4X | 34.1 GB/s | 0.6V | Low power | Jetson, smartphones |
| LPDDR5 | 51.2 GB/s | 0.5V | Low power + high bandwidth | Jetson Orin |
| LPDDR5X | 67.2 GB/s | 0.5V | Latest | High-end mobile SoCs |
Memory Selection for Robot Platforms
- Jetson Orin: LPDDR5, unified memory (shared by CPU and GPU)
- Raspberry Pi 5: LPDDR4X, 4/8GB
- STM32H7: Built-in 1MB SRAM, no external DRAM needed
Cache Coherence
Multi-Core Caching Problem
When multiple CPU cores each have their own L1 cache, the same data may exist in multiple copies:
Core 0 Core 1
+--------+ +--------+
|L1: X=5 | |L1: X=5 |
+----+---+ +----+---+
| +----------+ |
+----| L2: X=5 |----+
+----+-----+
|
+----+-----+
| DRAM: X=5|
+----------+
Core 0 writes X=10, but X in Core 1's cache is still 5 -> Inconsistency!
MESI Protocol
MESI is the most commonly used cache coherence protocol. Each cache line has four states:
| State | Meaning | Description |
|---|---|---|
| Modified | Modified | Only valid in this core, inconsistent with main memory |
| Exclusive | Exclusive | Only valid in this core, consistent with main memory |
| Shared | Shared | Valid in multiple cores, consistent with main memory |
| Invalid | Invalid | Cache line is invalid |
graph TD
I[Invalid] -->|"Local read (Bus Read)"| S[Shared]
I -->|"Local read (no other copy)"| E[Exclusive]
E -->|"Local write"| M[Modified]
S -->|"Local write (Invalidate)"| M
M -->|"Remote read (Write Back)"| S
E -->|"Remote read"| S
S -->|"Remote write (Invalidate)"| I
M -->|"Remote write (Write Back)"| I
Impact on Robot Software
In ROS2 multi-threaded nodes, when multiple threads share data, note that:
- Frequently writing to shared variables causes cache line bouncing
- Use
std::atomicor locks to ensure correctness - False sharing: Different variables happen to be in the same cache line (64 bytes); one thread writing causes the other thread's cache to be invalidated
Memory Hierarchy
Complete Storage Hierarchy
Average Access Time
The Average Memory Access Time (AMAT) for multi-level caches:
Where \(m_i\) is the miss rate of the \(i\)-th cache level.
Calculation Example
Assume \(T_{L1}=1\text{ns}\), \(m_{L1}=5\%\), \(T_{L2}=5\text{ns}\), \(m_{L2}=20\%\), \(T_{\text{DRAM}}=50\text{ns}\):
Principle of Locality
The theoretical foundation for cache effectiveness:
- Temporal Locality: Recently accessed data is likely to be accessed again soon
- Spatial Locality: After accessing a certain address, nearby addresses are also likely to be accessed
// Good spatial locality (row-major traversal)
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
process(image[i][j]); // Contiguous memory access
// Poor spatial locality (column-major traversal)
for (int j = 0; j < cols; j++)
for (int i = 0; i < rows; i++)
process(image[i][j]); // Strided memory access
DMA (Direct Memory Access)
DMA Principles
A DMA controller allows peripherals to exchange data directly with memory without CPU involvement:
graph LR
A[Sensor/Peripheral] -->|DMA Transfer| B[Memory DRAM]
C[CPU] -.->|Configure DMA| D[DMA Controller]
D -->|Manage Transfer| A
D -->|Manage Transfer| B
D -->|Transfer Complete Interrupt| C
DMA Applications in Robotics
| Scenario | Data Source | Destination | Data Volume | Description |
|---|---|---|---|---|
| Camera Capture | CSI Interface | Frame Buffer | ~6MB/frame (1080p) | ISP+DMA pipeline |
| LiDAR Data | SPI/Ethernet | Point Cloud Buffer | ~100KB/frame | High-frequency transfer |
| ADC Sampling | ADC Peripheral | Ring Buffer | ~Several KB/batch | STM32 DMA+ADC |
| Audio | I2S Interface | Audio Buffer | ~192KB/s | Double buffering |
STM32 DMA Configuration Example
// STM32 HAL: ADC continuous acquisition via DMA
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_buffer, BUFFER_SIZE);
// DMA transfer complete callback
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc) {
// Process the acquired data
process_sensor_data(adc_buffer, BUFFER_SIZE);
}
Storage Media Comparison
eMMC vs. NVMe
| Feature | eMMC 5.1 | NVMe SSD |
|---|---|---|
| Interface | Parallel MMC | PCIe Gen3/4 |
| Sequential Read | ~400 MB/s | ~3500 MB/s |
| Sequential Write | ~200 MB/s | ~3000 MB/s |
| Random Read (IOPS) | ~15K | ~500K |
| Power | ~0.5W | ~3-5W |
| Price (64GB) | ~$10 | ~$30 |
| Use | Embedded systems | High-performance needs |
Robot Storage Requirements
| Data Type | Data Rate | Storage Needed (1 hour) | Recommended Storage |
|---|---|---|---|
| Log files | ~1 MB/s | ~3.6 GB | eMMC |
| 720p video | ~5 MB/s | ~18 GB | NVMe |
| 1080p video | ~15 MB/s | ~54 GB | NVMe |
| Point cloud data | ~10 MB/s | ~36 GB | NVMe |
| rosbag recording | ~50 MB/s | ~180 GB | NVMe |
Memory-Mapped I/O (MMIO)
Principles
In ARM architecture, peripheral registers are mapped to the memory address space. The CPU controls hardware by reading from and writing to specific addresses:
Memory Address Space:
0x0000_0000 - 0x1FFF_FFFF : Flash (program code)
0x2000_0000 - 0x3FFF_FFFF : SRAM (data)
0x4000_0000 - 0x5FFF_FFFF : Peripheral registers (MMIO)
0x4000_0000 : TIM2 (timer)
0x4000_1000 : TIM3
0x4000_4400 : USART2
0x4001_0000 : GPIOA
...
0xE000_0000 - 0xFFFF_FFFF : System peripherals (NVIC, SysTick)
Accessing Hardware in Linux
// Accessing physical addresses via mmap in Linux
#include <sys/mman.h>
#include <fcntl.h>
int fd = open("/dev/mem", O_RDWR | O_SYNC);
volatile uint32_t* gpio = (uint32_t*)mmap(
NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED,
fd, GPIO_BASE_ADDR
);
// Set GPIO output
gpio[GPIO_SET_OFFSET/4] = (1 << pin);
Security Considerations
Directly accessing /dev/mem requires root privileges and poses security risks. In production systems, hardware should be accessed through device drivers or the sysfs interface.
Jetson Unified Memory Architecture
The Jetson platform employs a Unified Memory Architecture:
+-----------------------------+
| LPDDR5 Memory |
| (Shared by CPU and GPU, |
| no copy needed) |
+--------------+--------------+
| CPU View | GPU View |
| (cache | (cache |
| coherent) | coherent) |
+--------------+--------------+
Advantages:
- Zero-Copy: CPU and GPU can directly access the same memory
- Simplified programming: No explicit
cudaMemcpyneeded - Reduced latency: Avoids data transfer overhead
# PyTorch on Jetson: Using unified memory
import torch
# Create tensors directly on CUDA; CPU can also access them
tensor = torch.zeros(1000, 1000, device='cuda')
# Use pin_memory to accelerate host-to-device transfers
dataloader = DataLoader(dataset, pin_memory=True)
Memory Optimization Strategies
Robot System Memory Budget
| Component | Memory Usage | Description |
|---|---|---|
| Linux kernel + system services | ~500MB | Base overhead |
| ROS2 runtime | ~200MB | DDS + node management |
| Deep learning models | 500MB-4GB | Depends on model size |
| Camera frame buffers | ~100MB | Multi-frame buffering |
| Point cloud buffers | ~200MB | LiDAR data |
| Navigation maps | 100MB-1GB | Depends on map size |
| Total | ~2-6GB | 8GB is usually sufficient |
Memory Leak Detection
# Monitor process memory usage
watch -n 1 'ps aux --sort=-rss | head -10'
# Use valgrind for C++ memory leak detection
valgrind --leak-check=full ./robot_node
# Python memory profiling
python -m memory_profiler robot_script.py
Summary
- SRAM is fast but expensive for caches; DRAM is slow but large for main memory
- Cache coherence protocols (MESI) ensure multi-core data consistency
- DMA frees the CPU by letting sensor data write directly to memory
- NVMe SSDs are essential for data-intensive robots (e.g., autonomous driving)
- Jetson unified memory simplifies CPU-GPU data sharing
- Proper memory budgeting ensures stable system operation
References
- Patterson, D. A., & Hennessy, J. L. Computer Organization and Design
- STM32 Reference Manual (DMA chapter)
- NVIDIA Jetson Memory Architecture Guide
- Linux Kernel Documentation: Memory Management