Memory Hierarchy Design

1. Principle of Locality

The theoretical foundation of memory hierarchy design is the Principle of Locality:

Type	Meaning	Exploitation
Temporal locality	Recently accessed data is likely to be accessed again	Keep recent data in fast storage (caching)
Spatial locality	Data near the current address is likely to be accessed	Transfer data in blocks (cache lines)

2. Memory Hierarchy Overview

graph TB
    A["Registers<br/>~1 ns | ~1 KB | SRAM"] --> B["L1 Cache<br/>~1-2 ns | 32-128 KB | SRAM"]
    B --> C["L2 Cache<br/>~3-10 ns | 256 KB-1 MB | SRAM"]
    C --> D["L3 Cache<br/>~10-30 ns | 4-64 MB | SRAM"]
    D --> E["Main Memory DRAM<br/>~50-100 ns | 8-128 GB | DRAM"]
    E --> F["SSD<br/>~10-100 us | 256 GB-8 TB | Flash"]
    F --> G["HDD<br/>~5-10 ms | 1-20 TB | Magnetic Disk"]

    style A fill:#e8f5e9
    style B fill:#c8e6c9
    style C fill:#a5d6a7
    style D fill:#81c784
    style E fill:#fff9c4
    style F fill:#ffe0b2
    style G fill:#ffccbc

Core design goal: Run programs at speeds close to the fastest storage, with the capacity and cost close to the slowest storage.

3. Average Memory Access Time (AMAT)

\[ \text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty} \]

For multi-level caches:

\[ \text{AMAT} = t_{L1} + r_{L1} \times (t_{L2} + r_{L2} \times (t_{L3} + r_{L3} \times t_{\text{mem}})) \]

Numerical Example

With L1 hit time = 1 ns, L1 miss rate = 5%, miss penalty (main memory access) = 100 ns:

\[\text{AMAT} = 1 + 0.05 \times 100 = 1 + 5 = 6 \text{ ns}\]

If L1 miss rate drops from 5% to 2%:

\[\text{AMAT} = 1 + 0.02 \times 100 = 3 \text{ ns}\]

AMAT is halved -- demonstrating the enormous value of reducing the miss rate.

4. Cache Fundamentals

4.1 Cache Line Structure

Each cache line contains:

[Valid Bit(1)] [Dirty Bit(1)] [Tag(t bits)] [Data Block(B bytes)]

Address decomposition (\(m\)-bit address, \(2^s\) sets, \(E\) ways per set, block size \(B = 2^b\) bytes):

\[ \underbrace{\text{Tag}}_{t = m - s - b} \quad \underbrace{\text{Set Index}}_{s} \quad \underbrace{\text{Block Offset}}_{b} \]

4.2 Mapping Schemes

graph LR
    subgraph "Direct-Mapped"
        A1[Address] -->|Unique mapping| B1[1 location]
    end
    subgraph "Set-Associative"
        A2[Address] -->|Maps to set| B2[Choose 1 of N locations]
    end
    subgraph "Fully-Associative"
        A3[Address] -->|Any location| B3[All locations]
    end

Mapping Scheme	Sets	Ways per Set	Hit Latency	Conflict Misses	Hardware Complexity
Direct-mapped	\(C/B\)	1	Lowest	Highest	Lowest
\(N\)-way set-associative	\(C/(N \cdot B)\)	\(N\)	Medium	Medium	Medium
Fully-associative	1	\(C/B\)	Highest	Lowest	Highest

Typical configurations:

L1 Cache: 4-8 way set-associative (balance latency and miss rate)
L2 Cache: 8-16 way set-associative
TLB: Fully-associative (few entries, need lowest miss rate)

4.3 Replacement Policies

When a cache set is full, choose which line to evict:

Policy	Principle	Pros and Cons
LRU	Replace least recently used line	Optimal but high hardware cost for high associativity
Pseudo-LRU	Tree-based LRU approximation	Low-cost approximation
Random	Random selection	Simple; approaches LRU at high associativity
FIFO	First in, first out	Simple but may replace hot data

5. Cache Miss Classification -- The 3C Model

Type	English	Cause	Optimization
Compulsory miss	Compulsory	First access, data has never been in cache	Prefetching
Capacity miss	Capacity	Total cache capacity insufficient for working set	Increase cache size
Conflict miss	Conflict	Multiple addresses map to the same set	Increase associativity; Victim cache

Fourth type -- Coherence miss: In multi-core environments, write operations from other cores invalidate local cache lines.

6. Write Policies

Write Hit

Policy	Behavior	Pros and Cons
Write-Through	Write to both cache and next level simultaneously	Simple coherence, high bandwidth consumption
Write-Back	Write only to cache; write back dirty lines upon eviction	Reduced write traffic, requires dirty bit

Write Miss

Policy	Behavior
Write-Allocate	Fetch block into cache first, then write (pairs with write-back)
No-Write-Allocate	Write directly to next level (pairs with write-through)

Modern processors almost universally use the Write-Back + Write-Allocate combination.

7. Cache Optimization Techniques

7.1 Reducing Miss Rate

Increase block size: Exploit spatial locality (but too large increases capacity and conflict misses)
Increase cache capacity: Reduce capacity misses (but increases area and latency)
Increase associativity: Reduce conflict misses (rule of thumb: 2:1 rule -- miss rate of \(N\)-way \(\approx\) miss rate of \(N/2\)-way direct-mapped)
Victim Cache: Small fully-associative cache storing recently evicted lines

7.2 Reducing Miss Penalty

Multi-level caches: L1 (small, fast) -> L2 -> L3 (large, slow)
Critical Word First: The requested word is returned first; CPU immediately uses it
Early Restart: CPU restarts as soon as the needed word arrives

7.3 Reducing Hit Time

Small, simple L1 Cache: Direct-mapped or low associativity
Way Prediction: Predict which way will hit; compare that way first
VIPT (Virtual Index Physical Tag): Index with virtual address + compare with physical tag, enabling parallelism

7.4 Prefetching

Hardware prefetching: Detects access patterns (stride, stream) and automatically prefetches subsequent blocks
Software prefetching: Compiler/programmer inserts prefetch instructions (__builtin_prefetch)

8. Virtual Memory

8.1 Basic Concepts

Virtual memory provides each process with an independent, contiguous address space, managed cooperatively by the OS and hardware.

Core functions:

Address isolation: Processes are independent of each other
Memory protection: Permission control (read/write/execute)
Demand paging: Only active pages are loaded into physical memory
Sharing: Multiple processes can share physical pages (e.g., shared libraries)

8.2 Page Table

Virtual address -> physical address mapping:

\[ \text{VA} = \underbrace{\text{VPN}}_{\text{Virtual Page Number}} + \underbrace{\text{Offset}}_{\text{Page Offset}} \]

\[ \text{PA} = \underbrace{\text{PPN}}_{\text{Physical Page Number}} + \underbrace{\text{Offset}}_{\text{Page Offset}} \]

Page Table Entry (PTE) contains:

Field	Function
PPN	Physical page number
Valid	Whether in physical memory
Dirty	Whether modified
Reference	Whether accessed (used by replacement algorithms)
Protection	Read/write/execute permissions

8.3 Multi-Level Page Tables

A single-level page table requires too much contiguous memory (64-bit address + 4KB pages -> \(2^{52}\) PTEs).

Multi-level page tables allocate on demand:

x86-64: 4-level page table (PML4 -> PDPT -> PD -> PT), 9 bits per level index
RISC-V Sv48: 4-level page table
Only allocates page table pages for used virtual address ranges

8.4 TLB (Translation Lookaside Buffer)

Page table lookups require multiple memory accesses (4 levels = 4 accesses). The TLB is a hardware cache for page table translations:

\[ \text{Effective Access Time} = t_{\text{TLB}} \times h + (t_{\text{TLB}} + t_{\text{page walk}}) \times (1 - h) \]

where \(h\) is the TLB hit rate.