Memory and Storage

Overview

Memory and storage systems form the data infrastructure of robot computing platforms. From high-speed caches to persistent storage, different tiers of storage media achieve a balance among capacity, speed, and cost. Understanding the memory hierarchy is essential for optimizing data throughput in robot systems.

SRAM vs. DRAM

SRAM (Static Random-Access Memory)

SRAM uses six transistors (6T) to construct a single storage cell:

Feature	Value
Transistors/bit	6
Access Speed	1-2ns
Requires Refresh	No
Power Consumption	Low (static)
Density	Low
Cost	High
Use	CPU caches (L1/L2/L3)

DRAM (Dynamic Random-Access Memory)

DRAM uses one transistor + one capacitor (1T1C):

Feature	Value
Transistors/bit	1 + 1 capacitor
Access Speed	50-100ns
Requires Refresh	Yes (~every 64ms)
Power Consumption	Higher (refresh)
Density	High
Cost	Low
Use	Main memory (RAM)

DRAM Type Evolution

Type	Bandwidth	Voltage	Features	Application
DDR4	25.6 GB/s	1.2V	Mature and stable	PCs, servers
DDR5	51.2 GB/s	1.1V	Higher bandwidth	Next-gen PCs
LPDDR4X	34.1 GB/s	0.6V	Low power	Jetson, smartphones
LPDDR5	51.2 GB/s	0.5V	Low power + high bandwidth	Jetson Orin
LPDDR5X	67.2 GB/s	0.5V	Latest	High-end mobile SoCs

Memory Selection for Robot Platforms

Jetson Orin: LPDDR5, unified memory (shared by CPU and GPU)
Raspberry Pi 5: LPDDR4X, 4/8GB
STM32H7: Built-in 1MB SRAM, no external DRAM needed

Cache Coherence

Multi-Core Caching Problem

When multiple CPU cores each have their own L1 cache, the same data may exist in multiple copies:

Core 0               Core 1
+--------+           +--------+
|L1: X=5 |           |L1: X=5 |
+----+---+           +----+---+
     |    +----------+    |
     +----| L2: X=5  |----+
          +----+-----+
               |
          +----+-----+
          | DRAM: X=5|
          +----------+

Core 0 writes X=10, but X in Core 1's cache is still 5 -> Inconsistency!

MESI Protocol

MESI is the most commonly used cache coherence protocol. Each cache line has four states:

State	Meaning	Description
Modified	Modified	Only valid in this core, inconsistent with main memory
Exclusive	Exclusive	Only valid in this core, consistent with main memory
Shared	Shared	Valid in multiple cores, consistent with main memory
Invalid	Invalid	Cache line is invalid

graph TD
    I[Invalid] -->|"Local read (Bus Read)"| S[Shared]
    I -->|"Local read (no other copy)"| E[Exclusive]
    E -->|"Local write"| M[Modified]
    S -->|"Local write (Invalidate)"| M
    M -->|"Remote read (Write Back)"| S
    E -->|"Remote read"| S
    S -->|"Remote write (Invalidate)"| I
    M -->|"Remote write (Write Back)"| I

Impact on Robot Software

In ROS2 multi-threaded nodes, when multiple threads share data, note that:

Frequently writing to shared variables causes cache line bouncing
Use std::atomic or locks to ensure correctness
False sharing: Different variables happen to be in the same cache line (64 bytes); one thread writing causes the other thread's cache to be invalidated

Memory Hierarchy

Complete Storage Hierarchy

\[ \text{Registers} \xrightarrow{<1\text{ns}} \text{L1} \xrightarrow{2\text{ns}} \text{L2} \xrightarrow{10\text{ns}} \text{L3} \xrightarrow{50\text{ns}} \text{DRAM} \xrightarrow{100\mu\text{s}} \text{SSD} \xrightarrow{5\text{ms}} \text{HDD} \]

Average Access Time

The Average Memory Access Time (AMAT) for multi-level caches:

\[ \text{AMAT} = T_{L1} + m_{L1} \times (T_{L2} + m_{L2} \times (T_{L3} + m_{L3} \times T_{\text{DRAM}})) \]

Where $m_i$ is the miss rate of the $i$-th cache level.

Calculation Example

Assume $T_{L1}=1\text{ns}$, $m_{L1}=5\%$, $T_{L2}=5\text{ns}$, $m_{L2}=20\%$, $T_{\text{DRAM}}=50\text{ns}$:

\[\text{AMAT} = 1 + 0.05 \times (5 + 0.20 \times 50) = 1 + 0.05 \times 15 = 1.75\text{ns}\]

Principle of Locality

The theoretical foundation for cache effectiveness:

Temporal Locality: Recently accessed data is likely to be accessed again soon
Spatial Locality: After accessing a certain address, nearby addresses are also likely to be accessed

// Good spatial locality (row-major traversal)
for (int i = 0; i < rows; i++)
    for (int j = 0; j < cols; j++)
        process(image[i][j]);  // Contiguous memory access

// Poor spatial locality (column-major traversal)
for (int j = 0; j < cols; j++)
    for (int i = 0; i < rows; i++)
        process(image[i][j]);  // Strided memory access

DMA (Direct Memory Access)

DMA Principles

A DMA controller allows peripherals to exchange data directly with memory without CPU involvement:

graph LR
    A[Sensor/Peripheral] -->|DMA Transfer| B[Memory DRAM]
    C[CPU] -.->|Configure DMA| D[DMA Controller]
    D -->|Manage Transfer| A
    D -->|Manage Transfer| B
    D -->|Transfer Complete Interrupt| C

DMA Applications in Robotics

Scenario	Data Source	Destination	Data Volume	Description
Camera Capture	CSI Interface	Frame Buffer	~6MB/frame (1080p)	ISP+DMA pipeline
LiDAR Data	SPI/Ethernet	Point Cloud Buffer	~100KB/frame	High-frequency transfer
ADC Sampling	ADC Peripheral	Ring Buffer	~Several KB/batch	STM32 DMA+ADC
Audio	I2S Interface	Audio Buffer	~192KB/s	Double buffering

STM32 DMA Configuration Example

// STM32 HAL: ADC continuous acquisition via DMA
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_buffer, BUFFER_SIZE);

// DMA transfer complete callback
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc) {
    // Process the acquired data
    process_sensor_data(adc_buffer, BUFFER_SIZE);
}

Storage Media Comparison

eMMC vs. NVMe

Feature	eMMC 5.1	NVMe SSD
Interface	Parallel MMC	PCIe Gen3/4
Sequential Read	~400 MB/s	~3500 MB/s
Sequential Write	~200 MB/s	~3000 MB/s
Random Read (IOPS)	~15K	~500K
Power	~0.5W	~3-5W
Price (64GB)	~$10	~$30
Use	Embedded systems	High-performance needs

Robot Storage Requirements

Data Type	Data Rate	Storage Needed (1 hour)	Recommended Storage
Log files	~1 MB/s	~3.6 GB	eMMC
720p video	~5 MB/s	~18 GB	NVMe
1080p video	~15 MB/s	~54 GB	NVMe
Point cloud data	~10 MB/s	~36 GB	NVMe
rosbag recording	~50 MB/s	~180 GB	NVMe

Memory-Mapped I/O (MMIO)

Principles

In ARM architecture, peripheral registers are mapped to the memory address space. The CPU controls hardware by reading from and writing to specific addresses:

Memory Address Space:
0x0000_0000 - 0x1FFF_FFFF : Flash (program code)
0x2000_0000 - 0x3FFF_FFFF : SRAM (data)
0x4000_0000 - 0x5FFF_FFFF : Peripheral registers (MMIO)
    0x4000_0000 : TIM2 (timer)
    0x4000_1000 : TIM3
    0x4000_4400 : USART2
    0x4001_0000 : GPIOA
    ...
0xE000_0000 - 0xFFFF_FFFF : System peripherals (NVIC, SysTick)

Accessing Hardware in Linux

// Accessing physical addresses via mmap in Linux
#include <sys/mman.h>
#include <fcntl.h>

int fd = open("/dev/mem", O_RDWR | O_SYNC);
volatile uint32_t* gpio = (uint32_t*)mmap(
    NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED,
    fd, GPIO_BASE_ADDR
);

// Set GPIO output
gpio[GPIO_SET_OFFSET/4] = (1 << pin);

Security Considerations

Directly accessing /dev/mem requires root privileges and poses security risks. In production systems, hardware should be accessed through device drivers or the sysfs interface.

Jetson Unified Memory Architecture

The Jetson platform employs a Unified Memory Architecture:

+-----------------------------+
|        LPDDR5 Memory        |
|   (Shared by CPU and GPU,   |
|       no copy needed)       |
+--------------+--------------+
|   CPU View   |   GPU View   |
| (cache       | (cache       |
|  coherent)   |  coherent)   |
+--------------+--------------+

Advantages:

Zero-Copy: CPU and GPU can directly access the same memory
Simplified programming: No explicit cudaMemcpy needed
Reduced latency: Avoids data transfer overhead

# PyTorch on Jetson: Using unified memory
import torch

# Create tensors directly on CUDA; CPU can also access them
tensor = torch.zeros(1000, 1000, device='cuda')

# Use pin_memory to accelerate host-to-device transfers
dataloader = DataLoader(dataset, pin_memory=True)

Memory Optimization Strategies

Robot System Memory Budget

Component	Memory Usage	Description
Linux kernel + system services	~500MB	Base overhead
ROS2 runtime	~200MB	DDS + node management
Deep learning models	500MB-4GB	Depends on model size
Camera frame buffers	~100MB	Multi-frame buffering
Point cloud buffers	~200MB	LiDAR data
Navigation maps	100MB-1GB	Depends on map size
Total	~2-6GB	8GB is usually sufficient

Memory Leak Detection

# Monitor process memory usage
watch -n 1 'ps aux --sort=-rss | head -10'

# Use valgrind for C++ memory leak detection
valgrind --leak-check=full ./robot_node

# Python memory profiling
python -m memory_profiler robot_script.py

Summary

SRAM is fast but expensive for caches; DRAM is slow but large for main memory
Cache coherence protocols (MESI) ensure multi-core data consistency
DMA frees the CPU by letting sensor data write directly to memory
NVMe SSDs are essential for data-intensive robots (e.g., autonomous driving)
Jetson unified memory simplifies CPU-GPU data sharing
Proper memory budgeting ensures stable system operation

References

Patterson, D. A., & Hennessy, J. L. Computer Organization and Design
STM32 Reference Manual (DMA chapter)
NVIDIA Jetson Memory Architecture Guide
Linux Kernel Documentation: Memory Management