Skip to content

File Systems and I/O

1. File Abstraction

1.1 The Nature of Files

A file is the persistent data abstraction provided by the operating system -- presented to users as a named sequence of bytes, hiding the physical details of the underlying storage media.

File metadata:

Attribute Content
Name Human-readable identifier
Type Regular file, directory, symbolic link, device file, etc.
Size Number of bytes
Permissions owner/group/others rwx
Timestamps Creation time, modification time, access time
Owner UID / GID
Location Data block positions on disk

1.2 File Operations

Basic system calls:

open()   → Open a file, returns file descriptor fd
read()   → Read data from fd
write()  → Write data to fd
lseek()  → Move file offset (random access)
close()  → Close fd
stat()   → Get metadata

File descriptor (fd): A per-process integer index pointing to an entry in the kernel's open file table.

Process fd table → System open file table (offset, access mode) → inode table (metadata + data location)

2. Directory Structure

2.1 The Nature of Directories

A directory is a special file whose contents are a mapping table of filename -> inode number.

2.2 Directory Organization

Structure Characteristics
Single-level directory All files in one directory (naming conflicts)
Two-level directory One directory per user
Tree directory Hierarchical structure (modern standard)
Acyclic-graph directory Supports hard links and symbolic links

2.3 Path Resolution

Taking /home/user/file.txt as an example:

  1. Start from the root directory inode (inode 2)
  2. Read root directory data blocks, find home -> get inode number
  3. Read the inode of home, get data blocks
  4. Find user in the data blocks -> get inode number
  5. Repeat until finding the inode of file.txt

Each level requires reading the inode + reading directory data blocks -> deep paths have high access cost.

Type Mechanism Characteristics
Hard link Multiple directory entries point to the same inode Cannot cross filesystems, cannot link directories
Symbolic link Special file whose content is the target path string Can cross filesystems, can link directories

The inode's link count must reach zero before data blocks are freed.

3. Disk Space Allocation

3.1 Contiguous Allocation

Each file occupies a contiguous range of disk blocks.

Advantage Disadvantage
Both sequential and random access are fast Severe external fragmentation
Simple implementation (start block + length) File growth is difficult

Suitable for read-only media (e.g., ISO 9660 for CD-ROM).

3.2 Linked Allocation

Each disk block contains a pointer to the next block.

Advantage Disadvantage
No external fragmentation Random access is extremely slow (\(O(n)\) seeks)
Easy file growth Pointers consume space

FAT (File Allocation Table): Stores the linked list pointers centrally in a table, improving random access.

3.3 Indexed Allocation

Each file has an index block storing the addresses of all data blocks.

Unix inode multi-level indexing:

inode:
├── Direct pointers ×12       → Point directly to data blocks
├── Single indirect ×1        → Index block → data blocks
├── Double indirect ×1        → Index block → index block → data blocks
└── Triple indirect ×1        → Index block → index block → index block → data blocks

Assuming 4 KB block size, 4-byte pointers (1024 pointers per block):

Level Addressable Blocks Addressable Size
Direct (12) 12 48 KB
Single indirect 1024 4 MB
Double indirect \(1024^2\) 4 GB
Triple indirect \(1024^3\) 4 TB
\[ \text{Maximum file size} = (12 + 1024 + 1024^2 + 1024^3) \times 4\text{ KB} \approx 4\text{ TB} \]

Advantage: Small files are accessed quickly (direct pointers); large files are also supported.

3.4 Extent-Based Allocation

Modern file systems (ext4, XFS, Btrfs) use extents -- descriptors for contiguous block ranges (start block, length):

  • A single extent can describe a large contiguous region
  • Reduced metadata overhead
  • Improved sequential I/O performance

4. Free Space Management

Method Data Structure Characteristics
Bitmap One bit per block (0=free, 1=used) Compact; search efficiency depends on algorithm
Free list Free blocks linked in a list Allocation is fast, but traversal is slow
Free block groups Records contiguous free block ranges Suitable for extent-based file systems
B+ tree Keyed by free intervals Used by XFS, Btrfs; lookup \(O(\log n)\)

5. Modern File Systems

File System Features Use Case
ext4 Journaling, extents, delayed allocation Linux default
XFS B+ tree indexing, high concurrency, large file optimization Big data, servers
Btrfs COW, snapshots, checksums, RAID Advanced Linux features
ZFS Storage pools, end-to-end checksums, snapshots, dedup FreeBSD, storage servers
NTFS MFT, ACLs, encryption Windows
APFS COW, snapshots, encryption, space sharing macOS/iOS

Journaling File Systems

Prevent crash-induced file system inconsistency:

Mode Journal Contents Performance Safety
Journal (full) Metadata + data Slowest Highest
Ordered Metadata only (data written before journal) Medium High
Writeback Metadata only (data and journal order undefined) Fastest Medium

ext4 defaults to ordered mode.

COW (Copy-on-Write) File Systems

Core design of Btrfs and ZFS:

  • Never modify data in place -- always write to a new location
  • Natural support for snapshots (just preserve old pointers)
  • Natural data consistency (writes are atomic)
  • Cost: May cause fragmentation

6. RAID

RAID (Redundant Array of Independent Disks): Combines multiple disks to achieve high performance and/or high reliability.

RAID Level Structure Capacity Utilization Fault Tolerance Read Perf. Write Perf.
RAID 0 Striping 100% None \(N\times\) \(N\times\)
RAID 1 Mirroring 50% 1 disk failure \(N\times\) \(1\times\)
RAID 5 Distributed parity \((N-1)/N\) 1 disk failure \((N-1)\times\) Write amplification (read-modify-write parity)
RAID 6 Dual parity (P+Q) \((N-2)/N\) 2 disk failures \((N-2)\times\) More write amplification
RAID 10 Mirror then stripe 50% 1 per mirror pair \(N\times\) \(N/2\times\)
\[ \text{RAID 5 usable capacity} = (N - 1) \times \text{single disk capacity} \]

RAID Is Not Backup

RAID provides hardware fault tolerance but does not protect against human error, software bugs, or ransomware. Independent backups are always necessary.

7. I/O System

7.1 I/O Methods

Method Mechanism CPU Involvement Use Case
Programmed I/O (Polling) CPU polls device status register Continuously occupied Low latency, simple devices
Interrupt-driven I/O Device sends interrupt upon completion Only at start and finish General purpose
DMA (Direct Memory Access) DMA controller transfers directly between device and memory Only at start and finish Bulk data transfer

DMA flow:

  1. CPU configures DMA controller (source address, destination address, transfer size)
  2. DMA controller performs data transfer (CPU can do other work)
  3. Transfer complete -> DMA controller sends interrupt to CPU

7.2 I/O Schedulers

Disk Scheduling (HDD)

Disk seek time is the primary latency source. Scheduling algorithms optimize head movement:

Algorithm Strategy Characteristics
FCFS Process in arrival order Fair but large seek distances
SSTF Select nearest request (Shortest Seek Time First) May cause starvation
SCAN (Elevator) Head sweeps one direction to the end, then reverses Moderate wait times
C-SCAN (Circular Scan) Sweep one direction to the end, jump back to start Fairer
LOOK / C-LOOK Only go to the farthest request, not disk boundary Improved version of SCAN

Modern Linux I/O Schedulers

  • mq-deadline: Guarantees requests complete within deadline (default for HDD)
  • BFQ (Budget Fair Queueing): Optimized for desktop interactive responsiveness
  • none (noop): Suitable for SSD (no seek optimization needed)

SSD Characteristics

Feature Difference from HDD
No seek latency Random read/write performance far superior to HDD
Write amplification Must erase before writing (erase block > write page)
Wear leveling FTL controller distributes writes to extend lifespan
TRIM Notifies SSD which blocks are no longer in use, optimizing GC
Parallel channels Multi-channel/multi-die parallelism boosts bandwidth
\[ \text{SSD Write Amplification Factor (WAF)} = \frac{\text{Actual data written to flash}}{\text{Data written by host}} \]

Ideal WAF = 1; actual is typically 2-10 (depends on workload and GC efficiency).

7.3 I/O Software Stack

User space:  Application → Standard library (stdio)
             ─────── System call boundary ───────
Kernel:      VFS (Virtual File System layer)
             → Specific file system (ext4/XFS/...)
             → Page Cache
             → Block I/O layer (I/O scheduling)
             → Device driver
             → Hardware controller
Hardware:    Disk / SSD / NVMe

7.4 Page Cache

Linux uses free memory as a page cache:

  • Read cache: Data read from disk stays in memory; subsequent reads hit the cache
  • Write cache: Writes go to Page Cache first (dirty pages), flushed to disk in the background
  • Read-ahead: Detects sequential read patterns and prefetches subsequent data
Read path:  App → Page Cache hit? → Yes: Return directly
                                   → No: Read from disk into Page Cache → Return

Write path: App → Write to Page Cache (mark dirty)
            → pdflush/writeback thread periodically flushes to disk
            → or fsync() forces flush

7.5 Zero-Copy

Traditional file sending: Disk -> Kernel buffer -> User buffer -> Socket buffer -> NIC (4 copies).

sendfile() system call: Disk -> Kernel buffer -> NIC (2 copies), bypassing user space.

mmap + write: Another approach to reduce copies.

8. Storage Interface Evolution

Interface Protocol Bandwidth Latency Queue Depth
SATA AHCI ~600 MB/s ~100 us 32
SAS SCSI ~2.4 GB/s ~100 us 256
NVMe (PCIe 4.0) NVMe ~7 GB/s ~10 us 65535
NVMe (PCIe 5.0) NVMe ~14 GB/s ~10 us 65535

NVMe advantages: Designed specifically for flash; multi-queue (independent queue per CPU core), low latency, high parallelism.


Navigation


评论 #