Skip to content

Distributed Systems

Introduction

A distributed system consists of multiple independent computing nodes collaborating over a network. This article covers core theory (CAP theorem, consistency models, consensus algorithms) and engineering practices (microservices, message queues, distributed transactions).


1. CAP Theorem

The CAP theorem states that a distributed system facing network partitions cannot simultaneously satisfy all three properties:

Property Meaning
C (Consistency) All nodes see the same data
A (Availability) Every request receives a (non-error) response
P (Partition Tolerance) The system continues to operate despite network partitions

Since network partitions are unavoidable in practice, the real choice is between CP and AP:

Choice Meaning Representative Systems
CP Guarantees consistency, sacrifices availability ZooKeeper, etcd, HBase
AP Guarantees availability, sacrifices consistency Cassandra, DynamoDB, CouchDB

PACELC Extension

The PACELC theorem further states: even without partitions (E=Else), there is still a trade-off between Latency and Consistency.


2. Consistency Models

2.1 Strong Consistency (Linearizability)

All operations appear to execute in a global total order, and reads always return the most recent write value.

\[ \text{write}(x, v) \prec \text{read}(x) \Rightarrow \text{read}(x) = v \]
  • Implementation: single leader + synchronous replication
  • Cost: high latency
  • Representative: Spanner (TrueTime)

2.2 Eventual Consistency

If no new writes occur, all replicas will eventually converge to the same value.

\[ \lim_{t \to \infty} \text{replicas}(x, t) \to \text{same value} \]
  • Stale reads possible during the convergence window
  • Representatives: DNS, Amazon S3, Cassandra

2.3 Causal Consistency

Guarantees that causally related operations are observed in causal order by all nodes, while concurrent operations may be arbitrarily ordered.

Process A: write(x=1) → write(y=2)
Process B: if it sees y=2, it must have already seen x=1
  • Implementation: vector clocks
  • Weaker than strong consistency, stronger than eventual consistency

3. Consensus Algorithms

The consensus problem: multiple nodes agree on a value.

3.1 Paxos

Paxos is the most classic consensus algorithm, involving three roles:

Role Responsibility
Proposer Proposes a value
Acceptor Accepts/rejects proposals
Learner Learns the chosen value

Two-phase process:

Phase 1 (Prepare):
  Proposer → Acceptor: Prepare(n)
  Acceptor → Proposer: Promise(n, accepted_value)
  (Acceptor promises not to accept proposals with numbers less than n)

Phase 2 (Accept):
  Proposer → Acceptor: Accept(n, value)
  Acceptor → Proposer: Accepted(n, value)
  (Majority of Acceptors accept → value is chosen)

3.2 Raft

Raft is a simplification of Paxos, designed to be easier to understand and implement.

Leader Election:

                    Timeout
Follower ───────────────→ Candidate
    ↑                        │
    │ Discovers new Leader   │ Wins majority
    │                        ▼
    └──────────────────── Leader
                    ↑        │
                    │ Discovers higher term
                    └────────┘ → Follower

Election rules:

  1. Follower times out without receiving a heartbeat → becomes Candidate
  2. Candidate increments its term number, requests votes from all nodes
  3. Wins majority of votes → becomes Leader
  4. At most one Leader per term

Log Replication:

Client → Leader: write request
Leader: append to local log
Leader → Followers: AppendEntries RPC
Followers: append to local log, reply with acknowledgment
Leader: receives majority acknowledgment → commit
Leader → Client: return success
Leader → Followers: notify of commit

Consistency guarantees:

  • If two logs have the same term number at a given index, all entries before that index are identical
  • A Leader never overwrites its own log
  • Only a Candidate whose log is at least as up-to-date as the majority can win the election

4. Microservice Architecture

4.1 Service Discovery

┌──────────────┐
│ Service Registry │ ← Service registration center
│ (Consul/etcd)    │
└──────┬───────┘
       │ Register/Heartbeat/Query
  ┌────┼────┐
  ▼    ▼    ▼
 [A]  [B]  [C]  ← Microservice instances
Approach Implementation Representative
Client-side discovery Client queries registry directly Netflix Eureka
Server-side discovery Load balancer queries registry AWS ALB, Consul
DNS discovery Resolves service addresses via DNS Kubernetes Service

4.2 API Gateway

The API gateway is the unified entry point for microservices:

Client → [API Gateway] → Service A
                       → Service B
                       → Service C

Functions:
- Routing: routes to different services based on URL/Header
- Authentication: unified JWT/OAuth verification
- Rate limiting
- Circuit breaking
- Logging: request tracing

Representative tools: Kong, Nginx, AWS API Gateway, Envoy.

4.3 Circuit Breaker Pattern

Prevents cascading failures:

State machine:
         Failure rate exceeds threshold
Closed ──────────────→ Open
  ↑                      │
  │ Probe succeeds      │ Timeout
  │                      ▼
  └──────────────── Half-Open
       Probe fails → Open
class CircuitBreaker:
    CLOSED, OPEN, HALF_OPEN = 0, 1, 2

    def __init__(self, failure_threshold=5, timeout=30):
        self.state = self.CLOSED
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None

    def call(self, func, *args):
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = self.HALF_OPEN
            else:
                raise CircuitBreakerOpen()

        try:
            result = func(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

5. Message Queues

5.1 Kafka

Apache Kafka is a high-throughput, durable distributed message streaming platform.

Core concepts:

Concept Description
Topic Message category
Partition Partition of a Topic (unit of parallelism)
Producer Message producer
Consumer Message consumer
Consumer Group Consumer group (a partition is consumed by only one consumer in the group)
Offset Position of a message within a partition
Broker Kafka server node
Producer → Topic [P0|P1|P2] → Consumer Group
                                ├── Consumer 1 ← P0
                                ├── Consumer 2 ← P1
                                └── Consumer 3 ← P2

Characteristics:

  • Messages persisted to disk (sequential I/O, high throughput)
  • Supports message replay (re-consuming historical messages)
  • Partitions enable horizontal scaling

5.2 RabbitMQ

A traditional message queue based on the AMQP protocol.

Producer → Exchange → Queue → Consumer
           (routing rules)

Exchange types:
- Direct:  exact routing key match
- Fanout:  broadcast to all bound queues
- Topic:   wildcard routing key match
- Headers: match based on message headers

5.3 Kafka vs RabbitMQ

Dimension Kafka RabbitMQ
Model Publish-subscribe (log) Message queue
Throughput Millions/sec Tens of thousands/sec
Message retention Persistent, replayable Deleted after consumption
Order guarantee Ordered within partition Ordered within queue
Use cases Event streaming, logs, big data Task queues, RPC

6. Distributed Transactions

6.1 Two-Phase Commit (2PC)

Coordinator              Participant A    Participant B
  │── Prepare ──────→   │          │
  │                     │── OK ──→ │
  │← OK ───────────     │          │
  │                     │          │
  │── Commit ───────→   │          │
  │                     │── ACK ─→ │
  │← ACK ──────────     │          │

Problems:

  • Blocking: participants wait for the coordinator's decision
  • Single point of failure: if the coordinator crashes, participants block
  • Data inconsistency: network partitions can cause inconsistency

6.2 Saga Pattern

Splits a large transaction into a series of local transactions, each with a corresponding compensating action.

T1 → T2 → T3 → ... → Tn
                 ↓ failure
C3 ← C2 ← C1 ← ...   (compensation)

Example: order flow

1. Create order (T1)          Compensate: cancel order (C1)
2. Deduct inventory (T2)      Compensate: restore inventory (C2)  
3. Charge payment (T3)        Compensate: refund (C3)
4. Send notification (T4)     Compensate: send cancellation notice (C4)

If T3 fails: execute C2 → C1

Two orchestration approaches:

Approach Implementation Advantages Disadvantages
Choreography Event-driven, services communicate directly Decoupled Hard to trace
Orchestration Central orchestrator controls the flow Clear Orchestrator complexity

7. Distributed System Design Principles

Principle Meaning
Idempotency Executing the same operation multiple times has the same effect as once
Timeouts and retries Set reasonable timeouts with exponential backoff retries
Graceful degradation Core functionality continues when some features are unavailable
Observability Logging, Metrics, Tracing
Eventual consistency Accept short-term inconsistency, guarantee eventual convergence

Relations to Other Topics

  • See System Design for how service decomposition, caching, messaging, and capacity planning connect to distributed scenarios
  • See Database Systems for the connection between replication, transactions, consistency, and storage design
  • See Cloud Services for the infrastructure layer that supports elasticity and resilience
  • See Computer Networks for how latency, protocol behavior, and transport reliability shape distributed systems

References

  • "Designing Data-Intensive Applications" - Martin Kleppmann
  • "Distributed Systems" - Maarten van Steen & Andrew Tanenbaum
  • Raft paper: "In Search of an Understandable Consensus Algorithm"
  • Kafka Official Documentation

评论 #