Distributed Systems
Introduction
A distributed system consists of multiple independent computing nodes collaborating over a network. This article covers core theory (CAP theorem, consistency models, consensus algorithms) and engineering practices (microservices, message queues, distributed transactions).
1. CAP Theorem
The CAP theorem states that a distributed system facing network partitions cannot simultaneously satisfy all three properties:
| Property | Meaning |
|---|---|
| C (Consistency) | All nodes see the same data |
| A (Availability) | Every request receives a (non-error) response |
| P (Partition Tolerance) | The system continues to operate despite network partitions |
Since network partitions are unavoidable in practice, the real choice is between CP and AP:
| Choice | Meaning | Representative Systems |
|---|---|---|
| CP | Guarantees consistency, sacrifices availability | ZooKeeper, etcd, HBase |
| AP | Guarantees availability, sacrifices consistency | Cassandra, DynamoDB, CouchDB |
PACELC Extension
The PACELC theorem further states: even without partitions (E=Else), there is still a trade-off between Latency and Consistency.
2. Consistency Models
2.1 Strong Consistency (Linearizability)
All operations appear to execute in a global total order, and reads always return the most recent write value.
- Implementation: single leader + synchronous replication
- Cost: high latency
- Representative: Spanner (TrueTime)
2.2 Eventual Consistency
If no new writes occur, all replicas will eventually converge to the same value.
- Stale reads possible during the convergence window
- Representatives: DNS, Amazon S3, Cassandra
2.3 Causal Consistency
Guarantees that causally related operations are observed in causal order by all nodes, while concurrent operations may be arbitrarily ordered.
Process A: write(x=1) → write(y=2)
Process B: if it sees y=2, it must have already seen x=1
- Implementation: vector clocks
- Weaker than strong consistency, stronger than eventual consistency
3. Consensus Algorithms
The consensus problem: multiple nodes agree on a value.
3.1 Paxos
Paxos is the most classic consensus algorithm, involving three roles:
| Role | Responsibility |
|---|---|
| Proposer | Proposes a value |
| Acceptor | Accepts/rejects proposals |
| Learner | Learns the chosen value |
Two-phase process:
Phase 1 (Prepare):
Proposer → Acceptor: Prepare(n)
Acceptor → Proposer: Promise(n, accepted_value)
(Acceptor promises not to accept proposals with numbers less than n)
Phase 2 (Accept):
Proposer → Acceptor: Accept(n, value)
Acceptor → Proposer: Accepted(n, value)
(Majority of Acceptors accept → value is chosen)
3.2 Raft
Raft is a simplification of Paxos, designed to be easier to understand and implement.
Leader Election:
Timeout
Follower ───────────────→ Candidate
↑ │
│ Discovers new Leader │ Wins majority
│ ▼
└──────────────────── Leader
↑ │
│ Discovers higher term
└────────┘ → Follower
Election rules:
- Follower times out without receiving a heartbeat → becomes Candidate
- Candidate increments its term number, requests votes from all nodes
- Wins majority of votes → becomes Leader
- At most one Leader per term
Log Replication:
Client → Leader: write request
Leader: append to local log
Leader → Followers: AppendEntries RPC
Followers: append to local log, reply with acknowledgment
Leader: receives majority acknowledgment → commit
Leader → Client: return success
Leader → Followers: notify of commit
Consistency guarantees:
- If two logs have the same term number at a given index, all entries before that index are identical
- A Leader never overwrites its own log
- Only a Candidate whose log is at least as up-to-date as the majority can win the election
4. Microservice Architecture
4.1 Service Discovery
┌──────────────┐
│ Service Registry │ ← Service registration center
│ (Consul/etcd) │
└──────┬───────┘
│ Register/Heartbeat/Query
┌────┼────┐
▼ ▼ ▼
[A] [B] [C] ← Microservice instances
| Approach | Implementation | Representative |
|---|---|---|
| Client-side discovery | Client queries registry directly | Netflix Eureka |
| Server-side discovery | Load balancer queries registry | AWS ALB, Consul |
| DNS discovery | Resolves service addresses via DNS | Kubernetes Service |
4.2 API Gateway
The API gateway is the unified entry point for microservices:
Client → [API Gateway] → Service A
→ Service B
→ Service C
Functions:
- Routing: routes to different services based on URL/Header
- Authentication: unified JWT/OAuth verification
- Rate limiting
- Circuit breaking
- Logging: request tracing
Representative tools: Kong, Nginx, AWS API Gateway, Envoy.
4.3 Circuit Breaker Pattern
Prevents cascading failures:
State machine:
Failure rate exceeds threshold
Closed ──────────────→ Open
↑ │
│ Probe succeeds │ Timeout
│ ▼
└──────────────── Half-Open
Probe fails → Open
class CircuitBreaker:
CLOSED, OPEN, HALF_OPEN = 0, 1, 2
def __init__(self, failure_threshold=5, timeout=30):
self.state = self.CLOSED
self.failures = 0
self.threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
def call(self, func, *args):
if self.state == self.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = self.HALF_OPEN
else:
raise CircuitBreakerOpen()
try:
result = func(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
5. Message Queues
5.1 Kafka
Apache Kafka is a high-throughput, durable distributed message streaming platform.
Core concepts:
| Concept | Description |
|---|---|
| Topic | Message category |
| Partition | Partition of a Topic (unit of parallelism) |
| Producer | Message producer |
| Consumer | Message consumer |
| Consumer Group | Consumer group (a partition is consumed by only one consumer in the group) |
| Offset | Position of a message within a partition |
| Broker | Kafka server node |
Producer → Topic [P0|P1|P2] → Consumer Group
├── Consumer 1 ← P0
├── Consumer 2 ← P1
└── Consumer 3 ← P2
Characteristics:
- Messages persisted to disk (sequential I/O, high throughput)
- Supports message replay (re-consuming historical messages)
- Partitions enable horizontal scaling
5.2 RabbitMQ
A traditional message queue based on the AMQP protocol.
Producer → Exchange → Queue → Consumer
(routing rules)
Exchange types:
- Direct: exact routing key match
- Fanout: broadcast to all bound queues
- Topic: wildcard routing key match
- Headers: match based on message headers
5.3 Kafka vs RabbitMQ
| Dimension | Kafka | RabbitMQ |
|---|---|---|
| Model | Publish-subscribe (log) | Message queue |
| Throughput | Millions/sec | Tens of thousands/sec |
| Message retention | Persistent, replayable | Deleted after consumption |
| Order guarantee | Ordered within partition | Ordered within queue |
| Use cases | Event streaming, logs, big data | Task queues, RPC |
6. Distributed Transactions
6.1 Two-Phase Commit (2PC)
Coordinator Participant A Participant B
│── Prepare ──────→ │ │
│ │── OK ──→ │
│← OK ─────────── │ │
│ │ │
│── Commit ───────→ │ │
│ │── ACK ─→ │
│← ACK ────────── │ │
Problems:
- Blocking: participants wait for the coordinator's decision
- Single point of failure: if the coordinator crashes, participants block
- Data inconsistency: network partitions can cause inconsistency
6.2 Saga Pattern
Splits a large transaction into a series of local transactions, each with a corresponding compensating action.
T1 → T2 → T3 → ... → Tn
↓ failure
C3 ← C2 ← C1 ← ... (compensation)
Example: order flow
1. Create order (T1) Compensate: cancel order (C1)
2. Deduct inventory (T2) Compensate: restore inventory (C2)
3. Charge payment (T3) Compensate: refund (C3)
4. Send notification (T4) Compensate: send cancellation notice (C4)
If T3 fails: execute C2 → C1
Two orchestration approaches:
| Approach | Implementation | Advantages | Disadvantages |
|---|---|---|---|
| Choreography | Event-driven, services communicate directly | Decoupled | Hard to trace |
| Orchestration | Central orchestrator controls the flow | Clear | Orchestrator complexity |
7. Distributed System Design Principles
| Principle | Meaning |
|---|---|
| Idempotency | Executing the same operation multiple times has the same effect as once |
| Timeouts and retries | Set reasonable timeouts with exponential backoff retries |
| Graceful degradation | Core functionality continues when some features are unavailable |
| Observability | Logging, Metrics, Tracing |
| Eventual consistency | Accept short-term inconsistency, guarantee eventual convergence |
Relations to Other Topics
- See System Design for how service decomposition, caching, messaging, and capacity planning connect to distributed scenarios
- See Database Systems for the connection between replication, transactions, consistency, and storage design
- See Cloud Services for the infrastructure layer that supports elasticity and resilience
- See Computer Networks for how latency, protocol behavior, and transport reliability shape distributed systems
References
- "Designing Data-Intensive Applications" - Martin Kleppmann
- "Distributed Systems" - Maarten van Steen & Andrew Tanenbaum
- Raft paper: "In Search of an Understandable Consensus Algorithm"
- Kafka Official Documentation