Skip to content

Memory and Storage

Overview

Memory and storage systems form the data infrastructure of robot computing platforms. From high-speed caches to persistent storage, different tiers of storage media achieve a balance among capacity, speed, and cost. Understanding the memory hierarchy is essential for optimizing data throughput in robot systems.

SRAM vs. DRAM

SRAM (Static Random-Access Memory)

SRAM uses six transistors (6T) to construct a single storage cell:

Feature Value
Transistors/bit 6
Access Speed 1-2ns
Requires Refresh No
Power Consumption Low (static)
Density Low
Cost High
Use CPU caches (L1/L2/L3)

DRAM (Dynamic Random-Access Memory)

DRAM uses one transistor + one capacitor (1T1C):

Feature Value
Transistors/bit 1 + 1 capacitor
Access Speed 50-100ns
Requires Refresh Yes (~every 64ms)
Power Consumption Higher (refresh)
Density High
Cost Low
Use Main memory (RAM)

DRAM Type Evolution

Type Bandwidth Voltage Features Application
DDR4 25.6 GB/s 1.2V Mature and stable PCs, servers
DDR5 51.2 GB/s 1.1V Higher bandwidth Next-gen PCs
LPDDR4X 34.1 GB/s 0.6V Low power Jetson, smartphones
LPDDR5 51.2 GB/s 0.5V Low power + high bandwidth Jetson Orin
LPDDR5X 67.2 GB/s 0.5V Latest High-end mobile SoCs

Memory Selection for Robot Platforms

  • Jetson Orin: LPDDR5, unified memory (shared by CPU and GPU)
  • Raspberry Pi 5: LPDDR4X, 4/8GB
  • STM32H7: Built-in 1MB SRAM, no external DRAM needed

Cache Coherence

Multi-Core Caching Problem

When multiple CPU cores each have their own L1 cache, the same data may exist in multiple copies:

Core 0               Core 1
+--------+           +--------+
|L1: X=5 |           |L1: X=5 |
+----+---+           +----+---+
     |    +----------+    |
     +----| L2: X=5  |----+
          +----+-----+
               |
          +----+-----+
          | DRAM: X=5|
          +----------+

Core 0 writes X=10, but X in Core 1's cache is still 5 -> Inconsistency!

MESI Protocol

MESI is the most commonly used cache coherence protocol. Each cache line has four states:

State Meaning Description
Modified Modified Only valid in this core, inconsistent with main memory
Exclusive Exclusive Only valid in this core, consistent with main memory
Shared Shared Valid in multiple cores, consistent with main memory
Invalid Invalid Cache line is invalid
graph TD
    I[Invalid] -->|"Local read (Bus Read)"| S[Shared]
    I -->|"Local read (no other copy)"| E[Exclusive]
    E -->|"Local write"| M[Modified]
    S -->|"Local write (Invalidate)"| M
    M -->|"Remote read (Write Back)"| S
    E -->|"Remote read"| S
    S -->|"Remote write (Invalidate)"| I
    M -->|"Remote write (Write Back)"| I

Impact on Robot Software

In ROS2 multi-threaded nodes, when multiple threads share data, note that:

  • Frequently writing to shared variables causes cache line bouncing
  • Use std::atomic or locks to ensure correctness
  • False sharing: Different variables happen to be in the same cache line (64 bytes); one thread writing causes the other thread's cache to be invalidated

Memory Hierarchy

Complete Storage Hierarchy

\[ \text{Registers} \xrightarrow{<1\text{ns}} \text{L1} \xrightarrow{2\text{ns}} \text{L2} \xrightarrow{10\text{ns}} \text{L3} \xrightarrow{50\text{ns}} \text{DRAM} \xrightarrow{100\mu\text{s}} \text{SSD} \xrightarrow{5\text{ms}} \text{HDD} \]

Average Access Time

The Average Memory Access Time (AMAT) for multi-level caches:

\[ \text{AMAT} = T_{L1} + m_{L1} \times (T_{L2} + m_{L2} \times (T_{L3} + m_{L3} \times T_{\text{DRAM}})) \]

Where \(m_i\) is the miss rate of the \(i\)-th cache level.

Calculation Example

Assume \(T_{L1}=1\text{ns}\), \(m_{L1}=5\%\), \(T_{L2}=5\text{ns}\), \(m_{L2}=20\%\), \(T_{\text{DRAM}}=50\text{ns}\):

\[\text{AMAT} = 1 + 0.05 \times (5 + 0.20 \times 50) = 1 + 0.05 \times 15 = 1.75\text{ns}\]

Principle of Locality

The theoretical foundation for cache effectiveness:

  • Temporal Locality: Recently accessed data is likely to be accessed again soon
  • Spatial Locality: After accessing a certain address, nearby addresses are also likely to be accessed
// Good spatial locality (row-major traversal)
for (int i = 0; i < rows; i++)
    for (int j = 0; j < cols; j++)
        process(image[i][j]);  // Contiguous memory access

// Poor spatial locality (column-major traversal)
for (int j = 0; j < cols; j++)
    for (int i = 0; i < rows; i++)
        process(image[i][j]);  // Strided memory access

DMA (Direct Memory Access)

DMA Principles

A DMA controller allows peripherals to exchange data directly with memory without CPU involvement:

graph LR
    A[Sensor/Peripheral] -->|DMA Transfer| B[Memory DRAM]
    C[CPU] -.->|Configure DMA| D[DMA Controller]
    D -->|Manage Transfer| A
    D -->|Manage Transfer| B
    D -->|Transfer Complete Interrupt| C

DMA Applications in Robotics

Scenario Data Source Destination Data Volume Description
Camera Capture CSI Interface Frame Buffer ~6MB/frame (1080p) ISP+DMA pipeline
LiDAR Data SPI/Ethernet Point Cloud Buffer ~100KB/frame High-frequency transfer
ADC Sampling ADC Peripheral Ring Buffer ~Several KB/batch STM32 DMA+ADC
Audio I2S Interface Audio Buffer ~192KB/s Double buffering

STM32 DMA Configuration Example

// STM32 HAL: ADC continuous acquisition via DMA
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_buffer, BUFFER_SIZE);

// DMA transfer complete callback
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc) {
    // Process the acquired data
    process_sensor_data(adc_buffer, BUFFER_SIZE);
}

Storage Media Comparison

eMMC vs. NVMe

Feature eMMC 5.1 NVMe SSD
Interface Parallel MMC PCIe Gen3/4
Sequential Read ~400 MB/s ~3500 MB/s
Sequential Write ~200 MB/s ~3000 MB/s
Random Read (IOPS) ~15K ~500K
Power ~0.5W ~3-5W
Price (64GB) ~$10 ~$30
Use Embedded systems High-performance needs

Robot Storage Requirements

Data Type Data Rate Storage Needed (1 hour) Recommended Storage
Log files ~1 MB/s ~3.6 GB eMMC
720p video ~5 MB/s ~18 GB NVMe
1080p video ~15 MB/s ~54 GB NVMe
Point cloud data ~10 MB/s ~36 GB NVMe
rosbag recording ~50 MB/s ~180 GB NVMe

Memory-Mapped I/O (MMIO)

Principles

In ARM architecture, peripheral registers are mapped to the memory address space. The CPU controls hardware by reading from and writing to specific addresses:

Memory Address Space:
0x0000_0000 - 0x1FFF_FFFF : Flash (program code)
0x2000_0000 - 0x3FFF_FFFF : SRAM (data)
0x4000_0000 - 0x5FFF_FFFF : Peripheral registers (MMIO)
    0x4000_0000 : TIM2 (timer)
    0x4000_1000 : TIM3
    0x4000_4400 : USART2
    0x4001_0000 : GPIOA
    ...
0xE000_0000 - 0xFFFF_FFFF : System peripherals (NVIC, SysTick)

Accessing Hardware in Linux

// Accessing physical addresses via mmap in Linux
#include <sys/mman.h>
#include <fcntl.h>

int fd = open("/dev/mem", O_RDWR | O_SYNC);
volatile uint32_t* gpio = (uint32_t*)mmap(
    NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED,
    fd, GPIO_BASE_ADDR
);

// Set GPIO output
gpio[GPIO_SET_OFFSET/4] = (1 << pin);

Security Considerations

Directly accessing /dev/mem requires root privileges and poses security risks. In production systems, hardware should be accessed through device drivers or the sysfs interface.

Jetson Unified Memory Architecture

The Jetson platform employs a Unified Memory Architecture:

+-----------------------------+
|        LPDDR5 Memory        |
|   (Shared by CPU and GPU,   |
|       no copy needed)       |
+--------------+--------------+
|   CPU View   |   GPU View   |
| (cache       | (cache       |
|  coherent)   |  coherent)   |
+--------------+--------------+

Advantages:

  • Zero-Copy: CPU and GPU can directly access the same memory
  • Simplified programming: No explicit cudaMemcpy needed
  • Reduced latency: Avoids data transfer overhead
# PyTorch on Jetson: Using unified memory
import torch

# Create tensors directly on CUDA; CPU can also access them
tensor = torch.zeros(1000, 1000, device='cuda')

# Use pin_memory to accelerate host-to-device transfers
dataloader = DataLoader(dataset, pin_memory=True)

Memory Optimization Strategies

Robot System Memory Budget

Component Memory Usage Description
Linux kernel + system services ~500MB Base overhead
ROS2 runtime ~200MB DDS + node management
Deep learning models 500MB-4GB Depends on model size
Camera frame buffers ~100MB Multi-frame buffering
Point cloud buffers ~200MB LiDAR data
Navigation maps 100MB-1GB Depends on map size
Total ~2-6GB 8GB is usually sufficient

Memory Leak Detection

# Monitor process memory usage
watch -n 1 'ps aux --sort=-rss | head -10'

# Use valgrind for C++ memory leak detection
valgrind --leak-check=full ./robot_node

# Python memory profiling
python -m memory_profiler robot_script.py

Summary

  1. SRAM is fast but expensive for caches; DRAM is slow but large for main memory
  2. Cache coherence protocols (MESI) ensure multi-core data consistency
  3. DMA frees the CPU by letting sensor data write directly to memory
  4. NVMe SSDs are essential for data-intensive robots (e.g., autonomous driving)
  5. Jetson unified memory simplifies CPU-GPU data sharing
  6. Proper memory budgeting ensures stable system operation

References

  • Patterson, D. A., & Hennessy, J. L. Computer Organization and Design
  • STM32 Reference Manual (DMA chapter)
  • NVIDIA Jetson Memory Architecture Guide
  • Linux Kernel Documentation: Memory Management

评论 #