Memory Hierarchy Design
1. Principle of Locality
The theoretical foundation of memory hierarchy design is the Principle of Locality:
| Type | Meaning | Exploitation |
|---|---|---|
| Temporal locality | Recently accessed data is likely to be accessed again | Keep recent data in fast storage (caching) |
| Spatial locality | Data near the current address is likely to be accessed | Transfer data in blocks (cache lines) |
2. Memory Hierarchy Overview
graph TB
A["Registers<br/>~1 ns | ~1 KB | SRAM"] --> B["L1 Cache<br/>~1-2 ns | 32-128 KB | SRAM"]
B --> C["L2 Cache<br/>~3-10 ns | 256 KB-1 MB | SRAM"]
C --> D["L3 Cache<br/>~10-30 ns | 4-64 MB | SRAM"]
D --> E["Main Memory DRAM<br/>~50-100 ns | 8-128 GB | DRAM"]
E --> F["SSD<br/>~10-100 us | 256 GB-8 TB | Flash"]
F --> G["HDD<br/>~5-10 ms | 1-20 TB | Magnetic Disk"]
style A fill:#e8f5e9
style B fill:#c8e6c9
style C fill:#a5d6a7
style D fill:#81c784
style E fill:#fff9c4
style F fill:#ffe0b2
style G fill:#ffccbc
Core design goal: Run programs at speeds close to the fastest storage, with the capacity and cost close to the slowest storage.
3. Average Memory Access Time (AMAT)
For multi-level caches:
Numerical Example
With L1 hit time = 1 ns, L1 miss rate = 5%, miss penalty (main memory access) = 100 ns:
If L1 miss rate drops from 5% to 2%:
AMAT is halved -- demonstrating the enormous value of reducing the miss rate.
4. Cache Fundamentals
4.1 Cache Line Structure
Each cache line contains:
[Valid Bit(1)] [Dirty Bit(1)] [Tag(t bits)] [Data Block(B bytes)]
Address decomposition (\(m\)-bit address, \(2^s\) sets, \(E\) ways per set, block size \(B = 2^b\) bytes):
4.2 Mapping Schemes
graph LR
subgraph "Direct-Mapped"
A1[Address] -->|Unique mapping| B1[1 location]
end
subgraph "Set-Associative"
A2[Address] -->|Maps to set| B2[Choose 1 of N locations]
end
subgraph "Fully-Associative"
A3[Address] -->|Any location| B3[All locations]
end
| Mapping Scheme | Sets | Ways per Set | Hit Latency | Conflict Misses | Hardware Complexity |
|---|---|---|---|---|---|
| Direct-mapped | \(C/B\) | 1 | Lowest | Highest | Lowest |
| \(N\)-way set-associative | \(C/(N \cdot B)\) | \(N\) | Medium | Medium | Medium |
| Fully-associative | 1 | \(C/B\) | Highest | Lowest | Highest |
Typical configurations:
- L1 Cache: 4-8 way set-associative (balance latency and miss rate)
- L2 Cache: 8-16 way set-associative
- TLB: Fully-associative (few entries, need lowest miss rate)
4.3 Replacement Policies
When a cache set is full, choose which line to evict:
| Policy | Principle | Pros and Cons |
|---|---|---|
| LRU | Replace least recently used line | Optimal but high hardware cost for high associativity |
| Pseudo-LRU | Tree-based LRU approximation | Low-cost approximation |
| Random | Random selection | Simple; approaches LRU at high associativity |
| FIFO | First in, first out | Simple but may replace hot data |
5. Cache Miss Classification -- The 3C Model
| Type | English | Cause | Optimization |
|---|---|---|---|
| Compulsory miss | Compulsory | First access, data has never been in cache | Prefetching |
| Capacity miss | Capacity | Total cache capacity insufficient for working set | Increase cache size |
| Conflict miss | Conflict | Multiple addresses map to the same set | Increase associativity; Victim cache |
Fourth type -- Coherence miss: In multi-core environments, write operations from other cores invalidate local cache lines.
6. Write Policies
Write Hit
| Policy | Behavior | Pros and Cons |
|---|---|---|
| Write-Through | Write to both cache and next level simultaneously | Simple coherence, high bandwidth consumption |
| Write-Back | Write only to cache; write back dirty lines upon eviction | Reduced write traffic, requires dirty bit |
Write Miss
| Policy | Behavior |
|---|---|
| Write-Allocate | Fetch block into cache first, then write (pairs with write-back) |
| No-Write-Allocate | Write directly to next level (pairs with write-through) |
Modern processors almost universally use the Write-Back + Write-Allocate combination.
7. Cache Optimization Techniques
7.1 Reducing Miss Rate
- Increase block size: Exploit spatial locality (but too large increases capacity and conflict misses)
- Increase cache capacity: Reduce capacity misses (but increases area and latency)
- Increase associativity: Reduce conflict misses (rule of thumb: 2:1 rule -- miss rate of \(N\)-way \(\approx\) miss rate of \(N/2\)-way direct-mapped)
- Victim Cache: Small fully-associative cache storing recently evicted lines
7.2 Reducing Miss Penalty
- Multi-level caches: L1 (small, fast) -> L2 -> L3 (large, slow)
- Critical Word First: The requested word is returned first; CPU immediately uses it
- Early Restart: CPU restarts as soon as the needed word arrives
7.3 Reducing Hit Time
- Small, simple L1 Cache: Direct-mapped or low associativity
- Way Prediction: Predict which way will hit; compare that way first
- VIPT (Virtual Index Physical Tag): Index with virtual address + compare with physical tag, enabling parallelism
7.4 Prefetching
- Hardware prefetching: Detects access patterns (stride, stream) and automatically prefetches subsequent blocks
- Software prefetching: Compiler/programmer inserts prefetch instructions (
__builtin_prefetch)
8. Virtual Memory
8.1 Basic Concepts
Virtual memory provides each process with an independent, contiguous address space, managed cooperatively by the OS and hardware.
Core functions:
- Address isolation: Processes are independent of each other
- Memory protection: Permission control (read/write/execute)
- Demand paging: Only active pages are loaded into physical memory
- Sharing: Multiple processes can share physical pages (e.g., shared libraries)
8.2 Page Table
Virtual address -> physical address mapping:
Page Table Entry (PTE) contains:
| Field | Function |
|---|---|
| PPN | Physical page number |
| Valid | Whether in physical memory |
| Dirty | Whether modified |
| Reference | Whether accessed (used by replacement algorithms) |
| Protection | Read/write/execute permissions |
8.3 Multi-Level Page Tables
A single-level page table requires too much contiguous memory (64-bit address + 4KB pages -> \(2^{52}\) PTEs).
Multi-level page tables allocate on demand:
- x86-64: 4-level page table (PML4 -> PDPT -> PD -> PT), 9 bits per level index
- RISC-V Sv48: 4-level page table
- Only allocates page table pages for used virtual address ranges
8.4 TLB (Translation Lookaside Buffer)
Page table lookups require multiple memory accesses (4 levels = 4 accesses). The TLB is a hardware cache for page table translations:
where \(h\) is the TLB hit rate.
Typical configurations:
| Level | Capacity | Associativity | Latency |
|---|---|---|---|
| L1 ITLB | 64-128 entries | Fully-associative | 1 cycle |
| L1 DTLB | 64-128 entries | Fully-associative | 1 cycle |
| L2 TLB | 1024-2048 entries | 4-12 way | ~7 cycles |
8.5 Page Replacement Algorithms
| Algorithm | Principle | Characteristics |
|---|---|---|
| Optimal (OPT/MIN) | Replace the page used farthest in the future | Theoretically optimal, not implementable |
| LRU | Replace least recently used page | Approximates OPT, high implementation cost |
| Clock (Approximate LRU) | Circular list + reference bit | Most commonly used in practice |
| FIFO | First in, first out | Simple but poor performance (Belady's anomaly) |
Belady's Anomaly
The FIFO algorithm exhibits the counterintuitive phenomenon where increasing the number of page frames can increase the page fault rate. LRU and OPT are stack algorithms and do not suffer from Belady's anomaly.
8.6 Huge Pages (Superpages)
- Regular pages: 4 KB -> limited TLB coverage
- Huge pages: 2 MB or 1 GB -> greatly reduce TLB misses
- Applicable scenarios: Large memory workloads (databases, virtualization, scientific computing)
- Linux:
mmap+MAP_HUGETLBor Transparent Huge Pages (THP)
9. Storage Technology Trends
9.1 Emerging Memory Technologies
| Technology | Characteristics | Positioning |
|---|---|---|
| HBM (High Bandwidth Memory) | 3D stacking, ultra-high bandwidth | GPU/AI accelerator memory |
| CXL Memory | Memory pool expansion via CXL protocol | Data center memory disaggregation |
| Persistent Memory | Non-volatile, byte-addressable | New tier between DRAM and SSD |
9.2 Prefetcher Evolution
- Stride Prefetcher: Detects fixed-stride patterns
- Stream Prefetcher: Detects sequential access streams
- Irregular Prefetcher: Learns irregular access patterns
- ML-based Prefetcher: Uses machine learning models for prediction (research frontier)
Navigation
- Previous: Processor Microarchitecture
- Next: Parallel Architecture