Memory Management

1. Address Spaces

1.1 Logical Address vs Physical Address

Concept	Meaning	Generated By
Logical address (virtual address)	Address from the CPU/program's perspective	CPU
Physical address	Actual address in memory hardware	After MMU translation

The MMU (Memory Management Unit) translates logical addresses to physical addresses at runtime, completely transparent to programs.

1.2 Process Address Space Layout

Typical Linux process virtual address space (64-bit):

High addr ┌─────────────────┐
          │  Kernel space    │  Inaccessible to user
          ├─────────────────┤  0xFFFF800000000000
          │  Stack ↓         │  Grows downward
          │                  │
          │  (free region)   │
          │                  │
          │  mmap region     │  Shared libraries, mmap files
          │                  │
          │  Heap ↑          │  Grows upward (malloc/new)
          ├─────────────────┤
          │  BSS (uninit.)   │
          │  Data (init.)    │
          │  Text (code)     │  Read-only + executable
Low addr  └─────────────────┘  0x0000000000400000

2. Paging

2.1 Basic Concepts

Divide both the virtual address space and physical memory into fixed-size pages / frames:

\[ \text{Virtual address} = \underbrace{\text{VPN (Virtual Page Number)}}_{p} \;\|\; \underbrace{\text{Offset}}_{d} \]

\[ \text{Physical address} = \underbrace{\text{PFN (Physical Frame Number)}}_{f} \;\|\; \underbrace{\text{Offset}}_{d} \]

The page table performs the VPN -> PFN mapping. The offset remains unchanged.

Common page sizes: 4 KB ($2^{12}$ bytes); huge pages 2 MB / 1 GB.

2.2 Single-Level Page Table

Problem -- page table too large:

32-bit address space + 4 KB pages -> $2^{20}$ PTEs -> 4 MB per process at 4 bytes each
64-bit address space -> unacceptable

2.3 Multi-Level Page Table

Allocate page table pages on demand; unused virtual address ranges do not have allocated page tables.

Taking x86-64 four-level page table (48-bit virtual address) as an example:

Virtual address (48 bits):
┌──────┬──────┬──────┬──────┬────────┐
│ PML4 │ PDPT │  PD  │  PT  │ Offset │
│ 9bit │ 9bit │ 9bit │ 9bit │ 12bit  │
└──┬───┴──┬───┴──┬───┴──┬───┴────────┘
   │      │      │      │
   ▼      ▼      ▼      ▼
 PML4 → PDPT → PD → PT → Physical frame

Advantages:

Space-efficient -- only allocates page table pages for used address regions
Actual 48-bit virtual address space, most processes use only a tiny fraction

Cost:

One address translation requires 4 memory accesses (one per level)
The importance of TLB becomes even more prominent

2.4 Inverted Page Table

Indexed by physical frame number (rather than virtual page number)
Only one table system-wide, size = number of physical frames
Lookup requires hashing or linear search
Used in IA-64, PowerPC

2.5 TLB (Translation Lookaside Buffer)

Hardware cache for page table translations (see Memory Hierarchy Design for details).

\[ \text{EAT} = h \cdot (t_{\text{TLB}} + t_{\text{mem}}) + (1-h) \cdot (t_{\text{TLB}} + k \cdot t_{\text{mem}} + t_{\text{mem}}) \]

where $h$ is the TLB hit rate, $k$ is the number of page table levels, and $t_{\text{mem}}$ is the memory access time.

Numerical Example

TLB hit rate = 99%, TLB access = 1 ns, memory access = 100 ns, 4-level page table:

\[\text{EAT} = 0.99 \times (1 + 100) + 0.01 \times (1 + 4 \times 100 + 100)$$ $$= 0.99 \times 101 + 0.01 \times 501 = 99.99 + 5.01 = 105 \text{ ns}\]

Without TLB: each memory access would cost $5 \times 100 = 500$ ns -- the TLB reduces access time by 4.7x.

3. Segmentation

Divides the address space into semantically meaningful segments (code, data, stack segments, etc.), each with an independent base address and length.

\[ \text{Physical address} = \text{Base}[\text{segment}] + \text{Offset} \]

Paging vs Segmentation	Paging	Segmentation
Unit size	Fixed	Variable
External fragmentation	None	Present
Internal fragmentation	Present (last page)	None
Programmer visibility	Transparent	Visible (logical division)

Modern approach: Segmentation + paging combined (x86 retains segmentation but sets segment bases to 0, effectively pure paging).

4. Page Replacement Algorithms

When physical memory is insufficient, pages must be swapped out to disk.

4.1 Optimal Algorithm (OPT / Belady's MIN)

Replace the page that will not be used for the longest time in the future.

Theoretically optimal; serves as performance upper bound for other algorithms
Not implementable (requires future knowledge)

4.2 FIFO

Replace the page that entered memory earliest.

Belady's Anomaly

FIFO can exhibit the counterintuitive phenomenon where increasing the number of page frames increases the page fault rate.

Example: Reference sequence 1,2,3,4,1,2,5,1,2,3,4,5 - 3 frames: 9 page faults - 4 frames: 10 page faults (more!)

4.3 LRU (Least Recently Used)

Replace the page that has not been used for the longest time.

Based on temporal locality: past access patterns predict the future
Exact implementation requires maintaining access timestamps or a stack -> high hardware overhead
A stack algorithm; does not suffer from Belady's anomaly

4.4 Clock Algorithm (Approximate LRU)

The most commonly used replacement algorithm in actual operating systems:

Circular linked list with a clock hand scanning pages:

  Check reference bit:
    1 -> Clear it, advance hand (give a second chance)
    0 -> Replace this page, advance hand

Enhanced Clock (considering dirty bit):

(reference, dirty)	Priority	Meaning
(0, 0)	Best replacement	Not used and not modified
(0, 1)	Second best	Not used but modified (needs write-back)
(1, 0)	Worse	Recently used but not modified
(1, 1)	Worst	Recently used and modified

4.5 Algorithm Comparison

Algorithm	Fault Rate	Overhead	Belady's Anomaly	Practicality
OPT	Lowest	--	No	Not implementable
LRU	Close to OPT	High	No	Expensive to implement exactly
Clock	Close to LRU	Low	No	Most commonly used
FIFO	Higher	Lowest	Yes	Simple scenarios

5. Thrashing

5.1 Definition

When a process's working set exceeds available physical memory, frequent page faults cause the system to spend most of its time on page swapping, and CPU utilization drops sharply.

5.2 Working Set Model

Working set $W(t, \Delta)$: The set of pages referenced by a process within time window $\Delta$.

\[ \text{Total demand} = \sum_{i} |W_i(t, \Delta)| \]

If total demand > physical memory -> thrashing occurs.

OS strategies:

Monitor each process's working set size
If total demand exceeds memory, suspend some processes (medium-term scheduling)
Reactivate when conditions are met

5.3 PFF (Page Fault Frequency)

A more practical thrashing detection method:

Monitor each process's page fault rate
Fault rate > upper threshold -> allocate more frames
Fault rate < lower threshold -> reclaim some frames
No frames available to allocate -> suspend process

6. Memory Allocation

6.1 Kernel-Level Memory Allocation

Contiguous Memory Allocation

Algorithm	Strategy	Characteristics
First Fit	Find the first sufficiently large free block	Fast, low-address fragmentation
Best Fit	Find the smallest sufficiently large free block	Reduces waste but slow search, creates tiny fragments
Worst Fit	Find the largest free block	Reduces tiny fragments but large blocks deplete quickly
Next Fit	Continue searching from last allocation point	More even distribution

Buddy System

The foundation of Linux physical memory allocation:

Block sizes are $2^k$ pages ($0 \leq k \leq \text{MAX\_ORDER}-1$)
Allocation: Find the smallest $2^k$ block satisfying the request; if too large, split in half
Deallocation: Check if the buddy block is free; if so, merge into a $2^{k+1}$ block

\[ \text{Internal fragmentation} \leq 50\% \quad (\text{worst case: request } 2^{k-1}+1\text{, allocate }2^k) \]

Advantage: Merge operations are simple and efficient (buddy addresses differ by only one bit)

Slab Allocator

Linux kernel objects (task_struct, inode, etc.) are frequently allocated and freed. Slab provides object-level caching on top of Buddy:

Pre-allocates pools of specific-size objects
Released objects are retained in the cache for reuse
Eliminates internal fragmentation, reduces initialization overhead

6.2 User-Level Memory Allocation

malloc / free implementations (e.g., glibc ptmalloc, jemalloc, tcmalloc):

Allocator	Characteristics
ptmalloc (glibc)	Multiple arenas reduce lock contention; bins categorized by size
jemalloc (FreeBSD/Firefox)	Thread-local caches, better multi-thread scalability
tcmalloc (Google)	Thread-local allocation for small objects, centralized management for large objects
mimalloc (Microsoft)	Extreme small-object performance

7. Memory-Mapped I/O

mmap maps files or devices into a process's address space:

Use	Description
File mapping	Map file contents as memory, loaded on demand (lazy loading)
Anonymous mapping	Allocate memory not associated with a file (large malloc)
Shared mapping	Multiple processes share the same physical pages (IPC, shared libraries)
Private mapping	COW (Copy-on-Write), copy when written

Copy-on-Write (COW)

On fork(), parent and child share physical pages (marked read-only):

Parent or child attempts to write -> triggers page protection fault
Kernel copies the page -> each has an independent copy
If child immediately calls exec() -> most pages are never copied (highly efficient)

8. Memory Protection and Security

Mechanism	Function
Page table permission bits	Read/write/execute permission control
NX bit	Mark pages as non-executable (prevents code injection)
ASLR	Address Space Layout Randomization (prevents ROP attacks)
Guard Page	Stack overflow detection (inaccessible sentinel page)
SMEP/SMAP	Prevent kernel mode from executing/accessing user-space memory

9. Performance Optimization Practices

Problem	Optimization
Excessive TLB misses	Use huge pages, reduce address space fragmentation
High page fault rate	Add physical memory, optimize data locality
Memory fragmentation	Memory compaction, Slab allocator
NUMA latency	NUMA-aware memory allocation (`numactl`, `mbind`)
Excessive swapping	Add memory, adjust `swappiness`, use zswap compression

Navigation