File Systems and I/O
1. File Abstraction
1.1 The Nature of Files
A file is the persistent data abstraction provided by the operating system -- presented to users as a named sequence of bytes, hiding the physical details of the underlying storage media.
File metadata:
| Attribute | Content |
|---|---|
| Name | Human-readable identifier |
| Type | Regular file, directory, symbolic link, device file, etc. |
| Size | Number of bytes |
| Permissions | owner/group/others rwx |
| Timestamps | Creation time, modification time, access time |
| Owner | UID / GID |
| Location | Data block positions on disk |
1.2 File Operations
Basic system calls:
open() → Open a file, returns file descriptor fd
read() → Read data from fd
write() → Write data to fd
lseek() → Move file offset (random access)
close() → Close fd
stat() → Get metadata
File descriptor (fd): A per-process integer index pointing to an entry in the kernel's open file table.
Process fd table → System open file table (offset, access mode) → inode table (metadata + data location)
2. Directory Structure
2.1 The Nature of Directories
A directory is a special file whose contents are a mapping table of filename -> inode number.
2.2 Directory Organization
| Structure | Characteristics |
|---|---|
| Single-level directory | All files in one directory (naming conflicts) |
| Two-level directory | One directory per user |
| Tree directory | Hierarchical structure (modern standard) |
| Acyclic-graph directory | Supports hard links and symbolic links |
2.3 Path Resolution
Taking /home/user/file.txt as an example:
- Start from the root directory inode (inode 2)
- Read root directory data blocks, find
home-> get inode number - Read the inode of
home, get data blocks - Find
userin the data blocks -> get inode number - Repeat until finding the inode of
file.txt
Each level requires reading the inode + reading directory data blocks -> deep paths have high access cost.
2.4 Links
| Type | Mechanism | Characteristics |
|---|---|---|
| Hard link | Multiple directory entries point to the same inode | Cannot cross filesystems, cannot link directories |
| Symbolic link | Special file whose content is the target path string | Can cross filesystems, can link directories |
The inode's link count must reach zero before data blocks are freed.
3. Disk Space Allocation
3.1 Contiguous Allocation
Each file occupies a contiguous range of disk blocks.
| Advantage | Disadvantage |
|---|---|
| Both sequential and random access are fast | Severe external fragmentation |
| Simple implementation (start block + length) | File growth is difficult |
Suitable for read-only media (e.g., ISO 9660 for CD-ROM).
3.2 Linked Allocation
Each disk block contains a pointer to the next block.
| Advantage | Disadvantage |
|---|---|
| No external fragmentation | Random access is extremely slow (\(O(n)\) seeks) |
| Easy file growth | Pointers consume space |
FAT (File Allocation Table): Stores the linked list pointers centrally in a table, improving random access.
3.3 Indexed Allocation
Each file has an index block storing the addresses of all data blocks.
Unix inode multi-level indexing:
inode:
├── Direct pointers ×12 → Point directly to data blocks
├── Single indirect ×1 → Index block → data blocks
├── Double indirect ×1 → Index block → index block → data blocks
└── Triple indirect ×1 → Index block → index block → index block → data blocks
Assuming 4 KB block size, 4-byte pointers (1024 pointers per block):
| Level | Addressable Blocks | Addressable Size |
|---|---|---|
| Direct (12) | 12 | 48 KB |
| Single indirect | 1024 | 4 MB |
| Double indirect | \(1024^2\) | 4 GB |
| Triple indirect | \(1024^3\) | 4 TB |
Advantage: Small files are accessed quickly (direct pointers); large files are also supported.
3.4 Extent-Based Allocation
Modern file systems (ext4, XFS, Btrfs) use extents -- descriptors for contiguous block ranges (start block, length):
- A single extent can describe a large contiguous region
- Reduced metadata overhead
- Improved sequential I/O performance
4. Free Space Management
| Method | Data Structure | Characteristics |
|---|---|---|
| Bitmap | One bit per block (0=free, 1=used) | Compact; search efficiency depends on algorithm |
| Free list | Free blocks linked in a list | Allocation is fast, but traversal is slow |
| Free block groups | Records contiguous free block ranges | Suitable for extent-based file systems |
| B+ tree | Keyed by free intervals | Used by XFS, Btrfs; lookup \(O(\log n)\) |
5. Modern File Systems
| File System | Features | Use Case |
|---|---|---|
| ext4 | Journaling, extents, delayed allocation | Linux default |
| XFS | B+ tree indexing, high concurrency, large file optimization | Big data, servers |
| Btrfs | COW, snapshots, checksums, RAID | Advanced Linux features |
| ZFS | Storage pools, end-to-end checksums, snapshots, dedup | FreeBSD, storage servers |
| NTFS | MFT, ACLs, encryption | Windows |
| APFS | COW, snapshots, encryption, space sharing | macOS/iOS |
Journaling File Systems
Prevent crash-induced file system inconsistency:
| Mode | Journal Contents | Performance | Safety |
|---|---|---|---|
| Journal (full) | Metadata + data | Slowest | Highest |
| Ordered | Metadata only (data written before journal) | Medium | High |
| Writeback | Metadata only (data and journal order undefined) | Fastest | Medium |
ext4 defaults to ordered mode.
COW (Copy-on-Write) File Systems
Core design of Btrfs and ZFS:
- Never modify data in place -- always write to a new location
- Natural support for snapshots (just preserve old pointers)
- Natural data consistency (writes are atomic)
- Cost: May cause fragmentation
6. RAID
RAID (Redundant Array of Independent Disks): Combines multiple disks to achieve high performance and/or high reliability.
| RAID Level | Structure | Capacity Utilization | Fault Tolerance | Read Perf. | Write Perf. |
|---|---|---|---|---|---|
| RAID 0 | Striping | 100% | None | \(N\times\) | \(N\times\) |
| RAID 1 | Mirroring | 50% | 1 disk failure | \(N\times\) | \(1\times\) |
| RAID 5 | Distributed parity | \((N-1)/N\) | 1 disk failure | \((N-1)\times\) | Write amplification (read-modify-write parity) |
| RAID 6 | Dual parity (P+Q) | \((N-2)/N\) | 2 disk failures | \((N-2)\times\) | More write amplification |
| RAID 10 | Mirror then stripe | 50% | 1 per mirror pair | \(N\times\) | \(N/2\times\) |
RAID Is Not Backup
RAID provides hardware fault tolerance but does not protect against human error, software bugs, or ransomware. Independent backups are always necessary.
7. I/O System
7.1 I/O Methods
| Method | Mechanism | CPU Involvement | Use Case |
|---|---|---|---|
| Programmed I/O (Polling) | CPU polls device status register | Continuously occupied | Low latency, simple devices |
| Interrupt-driven I/O | Device sends interrupt upon completion | Only at start and finish | General purpose |
| DMA (Direct Memory Access) | DMA controller transfers directly between device and memory | Only at start and finish | Bulk data transfer |
DMA flow:
- CPU configures DMA controller (source address, destination address, transfer size)
- DMA controller performs data transfer (CPU can do other work)
- Transfer complete -> DMA controller sends interrupt to CPU
7.2 I/O Schedulers
Disk Scheduling (HDD)
Disk seek time is the primary latency source. Scheduling algorithms optimize head movement:
| Algorithm | Strategy | Characteristics |
|---|---|---|
| FCFS | Process in arrival order | Fair but large seek distances |
| SSTF | Select nearest request (Shortest Seek Time First) | May cause starvation |
| SCAN (Elevator) | Head sweeps one direction to the end, then reverses | Moderate wait times |
| C-SCAN (Circular Scan) | Sweep one direction to the end, jump back to start | Fairer |
| LOOK / C-LOOK | Only go to the farthest request, not disk boundary | Improved version of SCAN |
Modern Linux I/O Schedulers
- mq-deadline: Guarantees requests complete within deadline (default for HDD)
- BFQ (Budget Fair Queueing): Optimized for desktop interactive responsiveness
- none (noop): Suitable for SSD (no seek optimization needed)
SSD Characteristics
| Feature | Difference from HDD |
|---|---|
| No seek latency | Random read/write performance far superior to HDD |
| Write amplification | Must erase before writing (erase block > write page) |
| Wear leveling | FTL controller distributes writes to extend lifespan |
| TRIM | Notifies SSD which blocks are no longer in use, optimizing GC |
| Parallel channels | Multi-channel/multi-die parallelism boosts bandwidth |
Ideal WAF = 1; actual is typically 2-10 (depends on workload and GC efficiency).
7.3 I/O Software Stack
User space: Application → Standard library (stdio)
─────── System call boundary ───────
Kernel: VFS (Virtual File System layer)
→ Specific file system (ext4/XFS/...)
→ Page Cache
→ Block I/O layer (I/O scheduling)
→ Device driver
→ Hardware controller
Hardware: Disk / SSD / NVMe
7.4 Page Cache
Linux uses free memory as a page cache:
- Read cache: Data read from disk stays in memory; subsequent reads hit the cache
- Write cache: Writes go to Page Cache first (dirty pages), flushed to disk in the background
- Read-ahead: Detects sequential read patterns and prefetches subsequent data
Read path: App → Page Cache hit? → Yes: Return directly
→ No: Read from disk into Page Cache → Return
Write path: App → Write to Page Cache (mark dirty)
→ pdflush/writeback thread periodically flushes to disk
→ or fsync() forces flush
7.5 Zero-Copy
Traditional file sending: Disk -> Kernel buffer -> User buffer -> Socket buffer -> NIC (4 copies).
sendfile() system call: Disk -> Kernel buffer -> NIC (2 copies), bypassing user space.
mmap + write: Another approach to reduce copies.
8. Storage Interface Evolution
| Interface | Protocol | Bandwidth | Latency | Queue Depth |
|---|---|---|---|---|
| SATA | AHCI | ~600 MB/s | ~100 us | 32 |
| SAS | SCSI | ~2.4 GB/s | ~100 us | 256 |
| NVMe (PCIe 4.0) | NVMe | ~7 GB/s | ~10 us | 65535 |
| NVMe (PCIe 5.0) | NVMe | ~14 GB/s | ~10 us | 65535 |
NVMe advantages: Designed specifically for flash; multi-queue (independent queue per CPU core), low latency, high parallelism.
Navigation
- Previous: Memory Management
- Related: Memory Hierarchy Design