Memory Management
1. Address Spaces
1.1 Logical Address vs Physical Address
| Concept | Meaning | Generated By |
|---|---|---|
| Logical address (virtual address) | Address from the CPU/program's perspective | CPU |
| Physical address | Actual address in memory hardware | After MMU translation |
The MMU (Memory Management Unit) translates logical addresses to physical addresses at runtime, completely transparent to programs.
1.2 Process Address Space Layout
Typical Linux process virtual address space (64-bit):
High addr ┌─────────────────┐
│ Kernel space │ Inaccessible to user
├─────────────────┤ 0xFFFF800000000000
│ Stack ↓ │ Grows downward
│ │
│ (free region) │
│ │
│ mmap region │ Shared libraries, mmap files
│ │
│ Heap ↑ │ Grows upward (malloc/new)
├─────────────────┤
│ BSS (uninit.) │
│ Data (init.) │
│ Text (code) │ Read-only + executable
Low addr └─────────────────┘ 0x0000000000400000
2. Paging
2.1 Basic Concepts
Divide both the virtual address space and physical memory into fixed-size pages / frames:
The page table performs the VPN -> PFN mapping. The offset remains unchanged.
Common page sizes: 4 KB (\(2^{12}\) bytes); huge pages 2 MB / 1 GB.
2.2 Single-Level Page Table
Problem -- page table too large:
- 32-bit address space + 4 KB pages -> \(2^{20}\) PTEs -> 4 MB per process at 4 bytes each
- 64-bit address space -> unacceptable
2.3 Multi-Level Page Table
Allocate page table pages on demand; unused virtual address ranges do not have allocated page tables.
Taking x86-64 four-level page table (48-bit virtual address) as an example:
Virtual address (48 bits):
┌──────┬──────┬──────┬──────┬────────┐
│ PML4 │ PDPT │ PD │ PT │ Offset │
│ 9bit │ 9bit │ 9bit │ 9bit │ 12bit │
└──┬───┴──┬───┴──┬───┴──┬───┴────────┘
│ │ │ │
▼ ▼ ▼ ▼
PML4 → PDPT → PD → PT → Physical frame
Advantages:
- Space-efficient -- only allocates page table pages for used address regions
- Actual 48-bit virtual address space, most processes use only a tiny fraction
Cost:
- One address translation requires 4 memory accesses (one per level)
- The importance of TLB becomes even more prominent
2.4 Inverted Page Table
- Indexed by physical frame number (rather than virtual page number)
- Only one table system-wide, size = number of physical frames
- Lookup requires hashing or linear search
- Used in IA-64, PowerPC
2.5 TLB (Translation Lookaside Buffer)
Hardware cache for page table translations (see Memory Hierarchy Design for details).
where \(h\) is the TLB hit rate, \(k\) is the number of page table levels, and \(t_{\text{mem}}\) is the memory access time.
Numerical Example
TLB hit rate = 99%, TLB access = 1 ns, memory access = 100 ns, 4-level page table:
Without TLB: each memory access would cost \(5 \times 100 = 500\) ns -- the TLB reduces access time by 4.7x.
3. Segmentation
Divides the address space into semantically meaningful segments (code, data, stack segments, etc.), each with an independent base address and length.
| Paging vs Segmentation | Paging | Segmentation |
|---|---|---|
| Unit size | Fixed | Variable |
| External fragmentation | None | Present |
| Internal fragmentation | Present (last page) | None |
| Programmer visibility | Transparent | Visible (logical division) |
Modern approach: Segmentation + paging combined (x86 retains segmentation but sets segment bases to 0, effectively pure paging).
4. Page Replacement Algorithms
When physical memory is insufficient, pages must be swapped out to disk.
4.1 Optimal Algorithm (OPT / Belady's MIN)
Replace the page that will not be used for the longest time in the future.
- Theoretically optimal; serves as performance upper bound for other algorithms
- Not implementable (requires future knowledge)
4.2 FIFO
Replace the page that entered memory earliest.
Belady's Anomaly
FIFO can exhibit the counterintuitive phenomenon where increasing the number of page frames increases the page fault rate.
Example: Reference sequence 1,2,3,4,1,2,5,1,2,3,4,5 - 3 frames: 9 page faults - 4 frames: 10 page faults (more!)
4.3 LRU (Least Recently Used)
Replace the page that has not been used for the longest time.
- Based on temporal locality: past access patterns predict the future
- Exact implementation requires maintaining access timestamps or a stack -> high hardware overhead
- A stack algorithm; does not suffer from Belady's anomaly
4.4 Clock Algorithm (Approximate LRU)
The most commonly used replacement algorithm in actual operating systems:
Circular linked list with a clock hand scanning pages:
Check reference bit:
1 -> Clear it, advance hand (give a second chance)
0 -> Replace this page, advance hand
Enhanced Clock (considering dirty bit):
| (reference, dirty) | Priority | Meaning |
|---|---|---|
| (0, 0) | Best replacement | Not used and not modified |
| (0, 1) | Second best | Not used but modified (needs write-back) |
| (1, 0) | Worse | Recently used but not modified |
| (1, 1) | Worst | Recently used and modified |
4.5 Algorithm Comparison
| Algorithm | Fault Rate | Overhead | Belady's Anomaly | Practicality |
|---|---|---|---|---|
| OPT | Lowest | -- | No | Not implementable |
| LRU | Close to OPT | High | No | Expensive to implement exactly |
| Clock | Close to LRU | Low | No | Most commonly used |
| FIFO | Higher | Lowest | Yes | Simple scenarios |
5. Thrashing
5.1 Definition
When a process's working set exceeds available physical memory, frequent page faults cause the system to spend most of its time on page swapping, and CPU utilization drops sharply.
5.2 Working Set Model
Working set \(W(t, \Delta)\): The set of pages referenced by a process within time window \(\Delta\).
If total demand > physical memory -> thrashing occurs.
OS strategies:
- Monitor each process's working set size
- If total demand exceeds memory, suspend some processes (medium-term scheduling)
- Reactivate when conditions are met
5.3 PFF (Page Fault Frequency)
A more practical thrashing detection method:
- Monitor each process's page fault rate
- Fault rate > upper threshold -> allocate more frames
- Fault rate < lower threshold -> reclaim some frames
- No frames available to allocate -> suspend process
6. Memory Allocation
6.1 Kernel-Level Memory Allocation
Contiguous Memory Allocation
| Algorithm | Strategy | Characteristics |
|---|---|---|
| First Fit | Find the first sufficiently large free block | Fast, low-address fragmentation |
| Best Fit | Find the smallest sufficiently large free block | Reduces waste but slow search, creates tiny fragments |
| Worst Fit | Find the largest free block | Reduces tiny fragments but large blocks deplete quickly |
| Next Fit | Continue searching from last allocation point | More even distribution |
Buddy System
The foundation of Linux physical memory allocation:
- Block sizes are \(2^k\) pages (\(0 \leq k \leq \text{MAX\_ORDER}-1\))
- Allocation: Find the smallest \(2^k\) block satisfying the request; if too large, split in half
- Deallocation: Check if the buddy block is free; if so, merge into a \(2^{k+1}\) block
Advantage: Merge operations are simple and efficient (buddy addresses differ by only one bit)
Slab Allocator
Linux kernel objects (task_struct, inode, etc.) are frequently allocated and freed. Slab provides object-level caching on top of Buddy:
- Pre-allocates pools of specific-size objects
- Released objects are retained in the cache for reuse
- Eliminates internal fragmentation, reduces initialization overhead
6.2 User-Level Memory Allocation
malloc / free implementations (e.g., glibc ptmalloc, jemalloc, tcmalloc):
| Allocator | Characteristics |
|---|---|
| ptmalloc (glibc) | Multiple arenas reduce lock contention; bins categorized by size |
| jemalloc (FreeBSD/Firefox) | Thread-local caches, better multi-thread scalability |
| tcmalloc (Google) | Thread-local allocation for small objects, centralized management for large objects |
| mimalloc (Microsoft) | Extreme small-object performance |
7. Memory-Mapped I/O
mmap maps files or devices into a process's address space:
| Use | Description |
|---|---|
| File mapping | Map file contents as memory, loaded on demand (lazy loading) |
| Anonymous mapping | Allocate memory not associated with a file (large malloc) |
| Shared mapping | Multiple processes share the same physical pages (IPC, shared libraries) |
| Private mapping | COW (Copy-on-Write), copy when written |
Copy-on-Write (COW)
On fork(), parent and child share physical pages (marked read-only):
- Parent or child attempts to write -> triggers page protection fault
- Kernel copies the page -> each has an independent copy
- If child immediately calls
exec()-> most pages are never copied (highly efficient)
8. Memory Protection and Security
| Mechanism | Function |
|---|---|
| Page table permission bits | Read/write/execute permission control |
| NX bit | Mark pages as non-executable (prevents code injection) |
| ASLR | Address Space Layout Randomization (prevents ROP attacks) |
| Guard Page | Stack overflow detection (inaccessible sentinel page) |
| SMEP/SMAP | Prevent kernel mode from executing/accessing user-space memory |
9. Performance Optimization Practices
| Problem | Optimization |
|---|---|
| Excessive TLB misses | Use huge pages, reduce address space fragmentation |
| High page fault rate | Add physical memory, optimize data locality |
| Memory fragmentation | Memory compaction, Slab allocator |
| NUMA latency | NUMA-aware memory allocation (numactl, mbind) |
| Excessive swapping | Add memory, adjust swappiness, use zswap compression |
Navigation
- Previous: Processes and Threads
- Next: File Systems and I/O
- Related: Memory Hierarchy Design