File Systems and I/O

1. File Abstraction

1.1 The Nature of Files

A file is the persistent data abstraction provided by the operating system -- presented to users as a named sequence of bytes, hiding the physical details of the underlying storage media.

File metadata:

Attribute	Content
Name	Human-readable identifier
Type	Regular file, directory, symbolic link, device file, etc.
Size	Number of bytes
Permissions	owner/group/others rwx
Timestamps	Creation time, modification time, access time
Owner	UID / GID
Location	Data block positions on disk

1.2 File Operations

Basic system calls:

open()   → Open a file, returns file descriptor fd
read()   → Read data from fd
write()  → Write data to fd
lseek()  → Move file offset (random access)
close()  → Close fd
stat()   → Get metadata

File descriptor (fd): A per-process integer index pointing to an entry in the kernel's open file table.

Process fd table → System open file table (offset, access mode) → inode table (metadata + data location)

2. Directory Structure

2.1 The Nature of Directories

A directory is a special file whose contents are a mapping table of filename -> inode number.

2.2 Directory Organization

Structure	Characteristics
Single-level directory	All files in one directory (naming conflicts)
Two-level directory	One directory per user
Tree directory	Hierarchical structure (modern standard)
Acyclic-graph directory	Supports hard links and symbolic links

2.3 Path Resolution

Taking /home/user/file.txt as an example:

Start from the root directory inode (inode 2)
Read root directory data blocks, find home -> get inode number
Read the inode of home, get data blocks
Find user in the data blocks -> get inode number
Repeat until finding the inode of file.txt

Each level requires reading the inode + reading directory data blocks -> deep paths have high access cost.

2.4 Links

Type	Mechanism	Characteristics
Hard link	Multiple directory entries point to the same inode	Cannot cross filesystems, cannot link directories
Symbolic link	Special file whose content is the target path string	Can cross filesystems, can link directories

The inode's link count must reach zero before data blocks are freed.

3. Disk Space Allocation

3.1 Contiguous Allocation

Each file occupies a contiguous range of disk blocks.

Advantage	Disadvantage
Both sequential and random access are fast	Severe external fragmentation
Simple implementation (start block + length)	File growth is difficult

Suitable for read-only media (e.g., ISO 9660 for CD-ROM).

3.2 Linked Allocation

Each disk block contains a pointer to the next block.

Advantage	Disadvantage
No external fragmentation	Random access is extremely slow (\(O(n)\) seeks)
Easy file growth	Pointers consume space

FAT (File Allocation Table): Stores the linked list pointers centrally in a table, improving random access.

3.3 Indexed Allocation

Each file has an index block storing the addresses of all data blocks.

Unix inode multi-level indexing:

inode:
├── Direct pointers ×12       → Point directly to data blocks
├── Single indirect ×1        → Index block → data blocks
├── Double indirect ×1        → Index block → index block → data blocks
└── Triple indirect ×1        → Index block → index block → index block → data blocks

Assuming 4 KB block size, 4-byte pointers (1024 pointers per block):

Level	Addressable Blocks	Addressable Size
Direct (12)	12	48 KB
Single indirect	1024	4 MB
Double indirect	\(1024^2\)	4 GB
Triple indirect	\(1024^3\)	4 TB

\[ \text{Maximum file size} = (12 + 1024 + 1024^2 + 1024^3) \times 4\text{ KB} \approx 4\text{ TB} \]

Advantage: Small files are accessed quickly (direct pointers); large files are also supported.

3.4 Extent-Based Allocation

Modern file systems (ext4, XFS, Btrfs) use extents -- descriptors for contiguous block ranges (start block, length):

A single extent can describe a large contiguous region
Reduced metadata overhead
Improved sequential I/O performance

4. Free Space Management

Method	Data Structure	Characteristics
Bitmap	One bit per block (0=free, 1=used)	Compact; search efficiency depends on algorithm
Free list	Free blocks linked in a list	Allocation is fast, but traversal is slow
Free block groups	Records contiguous free block ranges	Suitable for extent-based file systems
B+ tree	Keyed by free intervals	Used by XFS, Btrfs; lookup \(O(\log n)\)

5. Modern File Systems

File System	Features	Use Case
ext4	Journaling, extents, delayed allocation	Linux default
XFS	B+ tree indexing, high concurrency, large file optimization	Big data, servers
Btrfs	COW, snapshots, checksums, RAID	Advanced Linux features
ZFS	Storage pools, end-to-end checksums, snapshots, dedup	FreeBSD, storage servers
NTFS	MFT, ACLs, encryption	Windows
APFS	COW, snapshots, encryption, space sharing	macOS/iOS

Journaling File Systems

Prevent crash-induced file system inconsistency:

Mode	Journal Contents	Performance	Safety
Journal (full)	Metadata + data	Slowest	Highest
Ordered	Metadata only (data written before journal)	Medium	High
Writeback	Metadata only (data and journal order undefined)	Fastest	Medium

ext4 defaults to ordered mode.

COW (Copy-on-Write) File Systems

Core design of Btrfs and ZFS:

Never modify data in place -- always write to a new location
Natural support for snapshots (just preserve old pointers)
Natural data consistency (writes are atomic)
Cost: May cause fragmentation

6. RAID

RAID (Redundant Array of Independent Disks): Combines multiple disks to achieve high performance and/or high reliability.

RAID Level	Structure	Capacity Utilization	Fault Tolerance	Read Perf.	Write Perf.
RAID 0	Striping	100%	None	\(N\times\)	\(N\times\)
RAID 1	Mirroring	50%	1 disk failure	\(N\times\)	\(1\times\)
RAID 5	Distributed parity	\((N-1)/N\)	1 disk failure	\((N-1)\times\)	Write amplification (read-modify-write parity)
RAID 6	Dual parity (P+Q)	\((N-2)/N\)	2 disk failures	\((N-2)\times\)	More write amplification
RAID 10	Mirror then stripe	50%	1 per mirror pair	\(N\times\)	\(N/2\times\)

\[ \text{RAID 5 usable capacity} = (N - 1) \times \text{single disk capacity} \]

RAID Is Not Backup

RAID provides hardware fault tolerance but does not protect against human error, software bugs, or ransomware. Independent backups are always necessary.

7. I/O System

7.1 I/O Methods

Method	Mechanism	CPU Involvement	Use Case
Programmed I/O (Polling)	CPU polls device status register	Continuously occupied	Low latency, simple devices
Interrupt-driven I/O	Device sends interrupt upon completion	Only at start and finish	General purpose
DMA (Direct Memory Access)	DMA controller transfers directly between device and memory	Only at start and finish	Bulk data transfer

DMA flow:

CPU configures DMA controller (source address, destination address, transfer size)
DMA controller performs data transfer (CPU can do other work)
Transfer complete -> DMA controller sends interrupt to CPU

7.2 I/O Schedulers

Disk Scheduling (HDD)

Disk seek time is the primary latency source. Scheduling algorithms optimize head movement:

Algorithm	Strategy	Characteristics
FCFS	Process in arrival order	Fair but large seek distances
SSTF	Select nearest request (Shortest Seek Time First)	May cause starvation
SCAN (Elevator)	Head sweeps one direction to the end, then reverses	Moderate wait times
C-SCAN (Circular Scan)	Sweep one direction to the end, jump back to start	Fairer
LOOK / C-LOOK	Only go to the farthest request, not disk boundary	Improved version of SCAN

Modern Linux I/O Schedulers

mq-deadline: Guarantees requests complete within deadline (default for HDD)
BFQ (Budget Fair Queueing): Optimized for desktop interactive responsiveness
none (noop): Suitable for SSD (no seek optimization needed)

SSD Characteristics

Feature	Difference from HDD
No seek latency	Random read/write performance far superior to HDD
Write amplification	Must erase before writing (erase block > write page)
Wear leveling	FTL controller distributes writes to extend lifespan
TRIM	Notifies SSD which blocks are no longer in use, optimizing GC
Parallel channels	Multi-channel/multi-die parallelism boosts bandwidth

\[ \text{SSD Write Amplification Factor (WAF)} = \frac{\text{Actual data written to flash}}{\text{Data written by host}} \]

Ideal WAF = 1; actual is typically 2-10 (depends on workload and GC efficiency).

7.3 I/O Software Stack

User space:  Application → Standard library (stdio)
             ─────── System call boundary ───────
Kernel:      VFS (Virtual File System layer)
             → Specific file system (ext4/XFS/...)
             → Page Cache
             → Block I/O layer (I/O scheduling)
             → Device driver
             → Hardware controller
Hardware:    Disk / SSD / NVMe

7.4 Page Cache

Linux uses free memory as a page cache:

Read cache: Data read from disk stays in memory; subsequent reads hit the cache
Write cache: Writes go to Page Cache first (dirty pages), flushed to disk in the background
Read-ahead: Detects sequential read patterns and prefetches subsequent data

Read path:  App → Page Cache hit? → Yes: Return directly
                                   → No: Read from disk into Page Cache → Return

Write path: App → Write to Page Cache (mark dirty)
            → pdflush/writeback thread periodically flushes to disk
            → or fsync() forces flush

7.5 Zero-Copy

Traditional file sending: Disk -> Kernel buffer -> User buffer -> Socket buffer -> NIC (4 copies).

sendfile() system call: Disk -> Kernel buffer -> NIC (2 copies), bypassing user space.

mmap + write: Another approach to reduce copies.

8. Storage Interface Evolution

Interface	Protocol	Bandwidth	Latency	Queue Depth
SATA	AHCI	~600 MB/s	~100 us	32
SAS	SCSI	~2.4 GB/s	~100 us	256
NVMe (PCIe 4.0)	NVMe	~7 GB/s	~10 us	65535
NVMe (PCIe 5.0)	NVMe	~14 GB/s	~10 us	65535

NVMe advantages: Designed specifically for flash; multi-queue (independent queue per CPU core), low latency, high parallelism.

Navigation

Previous: Memory Management
Related: Memory Hierarchy Design