Skip to content

Memory Hierarchy Design

1. Principle of Locality

The theoretical foundation of memory hierarchy design is the Principle of Locality:

Type Meaning Exploitation
Temporal locality Recently accessed data is likely to be accessed again Keep recent data in fast storage (caching)
Spatial locality Data near the current address is likely to be accessed Transfer data in blocks (cache lines)

2. Memory Hierarchy Overview

graph TB
    A["Registers<br/>~1 ns | ~1 KB | SRAM"] --> B["L1 Cache<br/>~1-2 ns | 32-128 KB | SRAM"]
    B --> C["L2 Cache<br/>~3-10 ns | 256 KB-1 MB | SRAM"]
    C --> D["L3 Cache<br/>~10-30 ns | 4-64 MB | SRAM"]
    D --> E["Main Memory DRAM<br/>~50-100 ns | 8-128 GB | DRAM"]
    E --> F["SSD<br/>~10-100 us | 256 GB-8 TB | Flash"]
    F --> G["HDD<br/>~5-10 ms | 1-20 TB | Magnetic Disk"]

    style A fill:#e8f5e9
    style B fill:#c8e6c9
    style C fill:#a5d6a7
    style D fill:#81c784
    style E fill:#fff9c4
    style F fill:#ffe0b2
    style G fill:#ffccbc

Core design goal: Run programs at speeds close to the fastest storage, with the capacity and cost close to the slowest storage.

3. Average Memory Access Time (AMAT)

\[ \text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty} \]

For multi-level caches:

\[ \text{AMAT} = t_{L1} + r_{L1} \times (t_{L2} + r_{L2} \times (t_{L3} + r_{L3} \times t_{\text{mem}})) \]

Numerical Example

With L1 hit time = 1 ns, L1 miss rate = 5%, miss penalty (main memory access) = 100 ns:

\[\text{AMAT} = 1 + 0.05 \times 100 = 1 + 5 = 6 \text{ ns}\]

If L1 miss rate drops from 5% to 2%:

\[\text{AMAT} = 1 + 0.02 \times 100 = 3 \text{ ns}\]

AMAT is halved -- demonstrating the enormous value of reducing the miss rate.

4. Cache Fundamentals

4.1 Cache Line Structure

Each cache line contains:

[Valid Bit(1)] [Dirty Bit(1)] [Tag(t bits)] [Data Block(B bytes)]

Address decomposition (\(m\)-bit address, \(2^s\) sets, \(E\) ways per set, block size \(B = 2^b\) bytes):

\[ \underbrace{\text{Tag}}_{t = m - s - b} \quad \underbrace{\text{Set Index}}_{s} \quad \underbrace{\text{Block Offset}}_{b} \]

4.2 Mapping Schemes

graph LR
    subgraph "Direct-Mapped"
        A1[Address] -->|Unique mapping| B1[1 location]
    end
    subgraph "Set-Associative"
        A2[Address] -->|Maps to set| B2[Choose 1 of N locations]
    end
    subgraph "Fully-Associative"
        A3[Address] -->|Any location| B3[All locations]
    end
Mapping Scheme Sets Ways per Set Hit Latency Conflict Misses Hardware Complexity
Direct-mapped \(C/B\) 1 Lowest Highest Lowest
\(N\)-way set-associative \(C/(N \cdot B)\) \(N\) Medium Medium Medium
Fully-associative 1 \(C/B\) Highest Lowest Highest

Typical configurations:

  • L1 Cache: 4-8 way set-associative (balance latency and miss rate)
  • L2 Cache: 8-16 way set-associative
  • TLB: Fully-associative (few entries, need lowest miss rate)

4.3 Replacement Policies

When a cache set is full, choose which line to evict:

Policy Principle Pros and Cons
LRU Replace least recently used line Optimal but high hardware cost for high associativity
Pseudo-LRU Tree-based LRU approximation Low-cost approximation
Random Random selection Simple; approaches LRU at high associativity
FIFO First in, first out Simple but may replace hot data

5. Cache Miss Classification -- The 3C Model

Type English Cause Optimization
Compulsory miss Compulsory First access, data has never been in cache Prefetching
Capacity miss Capacity Total cache capacity insufficient for working set Increase cache size
Conflict miss Conflict Multiple addresses map to the same set Increase associativity; Victim cache

Fourth type -- Coherence miss: In multi-core environments, write operations from other cores invalidate local cache lines.

6. Write Policies

Write Hit

Policy Behavior Pros and Cons
Write-Through Write to both cache and next level simultaneously Simple coherence, high bandwidth consumption
Write-Back Write only to cache; write back dirty lines upon eviction Reduced write traffic, requires dirty bit

Write Miss

Policy Behavior
Write-Allocate Fetch block into cache first, then write (pairs with write-back)
No-Write-Allocate Write directly to next level (pairs with write-through)

Modern processors almost universally use the Write-Back + Write-Allocate combination.

7. Cache Optimization Techniques

7.1 Reducing Miss Rate

  • Increase block size: Exploit spatial locality (but too large increases capacity and conflict misses)
  • Increase cache capacity: Reduce capacity misses (but increases area and latency)
  • Increase associativity: Reduce conflict misses (rule of thumb: 2:1 rule -- miss rate of \(N\)-way \(\approx\) miss rate of \(N/2\)-way direct-mapped)
  • Victim Cache: Small fully-associative cache storing recently evicted lines

7.2 Reducing Miss Penalty

  • Multi-level caches: L1 (small, fast) -> L2 -> L3 (large, slow)
  • Critical Word First: The requested word is returned first; CPU immediately uses it
  • Early Restart: CPU restarts as soon as the needed word arrives

7.3 Reducing Hit Time

  • Small, simple L1 Cache: Direct-mapped or low associativity
  • Way Prediction: Predict which way will hit; compare that way first
  • VIPT (Virtual Index Physical Tag): Index with virtual address + compare with physical tag, enabling parallelism

7.4 Prefetching

  • Hardware prefetching: Detects access patterns (stride, stream) and automatically prefetches subsequent blocks
  • Software prefetching: Compiler/programmer inserts prefetch instructions (__builtin_prefetch)

8. Virtual Memory

8.1 Basic Concepts

Virtual memory provides each process with an independent, contiguous address space, managed cooperatively by the OS and hardware.

Core functions:

  • Address isolation: Processes are independent of each other
  • Memory protection: Permission control (read/write/execute)
  • Demand paging: Only active pages are loaded into physical memory
  • Sharing: Multiple processes can share physical pages (e.g., shared libraries)

8.2 Page Table

Virtual address -> physical address mapping:

\[ \text{VA} = \underbrace{\text{VPN}}_{\text{Virtual Page Number}} + \underbrace{\text{Offset}}_{\text{Page Offset}} \]
\[ \text{PA} = \underbrace{\text{PPN}}_{\text{Physical Page Number}} + \underbrace{\text{Offset}}_{\text{Page Offset}} \]

Page Table Entry (PTE) contains:

Field Function
PPN Physical page number
Valid Whether in physical memory
Dirty Whether modified
Reference Whether accessed (used by replacement algorithms)
Protection Read/write/execute permissions

8.3 Multi-Level Page Tables

A single-level page table requires too much contiguous memory (64-bit address + 4KB pages -> \(2^{52}\) PTEs).

Multi-level page tables allocate on demand:

  • x86-64: 4-level page table (PML4 -> PDPT -> PD -> PT), 9 bits per level index
  • RISC-V Sv48: 4-level page table
  • Only allocates page table pages for used virtual address ranges

8.4 TLB (Translation Lookaside Buffer)

Page table lookups require multiple memory accesses (4 levels = 4 accesses). The TLB is a hardware cache for page table translations:

\[ \text{Effective Access Time} = t_{\text{TLB}} \times h + (t_{\text{TLB}} + t_{\text{page walk}}) \times (1 - h) \]

where \(h\) is the TLB hit rate.

Typical configurations:

Level Capacity Associativity Latency
L1 ITLB 64-128 entries Fully-associative 1 cycle
L1 DTLB 64-128 entries Fully-associative 1 cycle
L2 TLB 1024-2048 entries 4-12 way ~7 cycles

8.5 Page Replacement Algorithms

Algorithm Principle Characteristics
Optimal (OPT/MIN) Replace the page used farthest in the future Theoretically optimal, not implementable
LRU Replace least recently used page Approximates OPT, high implementation cost
Clock (Approximate LRU) Circular list + reference bit Most commonly used in practice
FIFO First in, first out Simple but poor performance (Belady's anomaly)

Belady's Anomaly

The FIFO algorithm exhibits the counterintuitive phenomenon where increasing the number of page frames can increase the page fault rate. LRU and OPT are stack algorithms and do not suffer from Belady's anomaly.

8.6 Huge Pages (Superpages)

  • Regular pages: 4 KB -> limited TLB coverage
  • Huge pages: 2 MB or 1 GB -> greatly reduce TLB misses
  • Applicable scenarios: Large memory workloads (databases, virtualization, scientific computing)
  • Linux: mmap + MAP_HUGETLB or Transparent Huge Pages (THP)

9.1 Emerging Memory Technologies

Technology Characteristics Positioning
HBM (High Bandwidth Memory) 3D stacking, ultra-high bandwidth GPU/AI accelerator memory
CXL Memory Memory pool expansion via CXL protocol Data center memory disaggregation
Persistent Memory Non-volatile, byte-addressable New tier between DRAM and SSD

9.2 Prefetcher Evolution

  • Stride Prefetcher: Detects fixed-stride patterns
  • Stream Prefetcher: Detects sequential access streams
  • Irregular Prefetcher: Learns irregular access patterns
  • ML-based Prefetcher: Uses machine learning models for prediction (research frontier)


评论 #