Fault Tolerance and Replication
Fault tolerance and replication are the techniques by which distributed systems continue to provide correct service despite the failure of some of their components.
Definition
Fault tolerance is the ability of a system to keep meeting its specification despite faults in some components; replication—maintaining multiple copies of computation or data—is the principal mechanism for achieving it, requiring protocols to keep the copies suitably consistent.
Scope
This area covers redundancy as the basis of fault tolerance, the state-machine replication approach to building reliable services, data replication and the consistency models that govern replicated data, the CAP theorem and the consistency-availability trade-off under partitions, and rollback-recovery techniques based on checkpointing and logging. It connects the theory of consensus and ordering to the construction of dependable systems.
Sub-topics
Core questions
- How does redundancy turn unreliable components into a reliable service?
- What consistency must replicas maintain, and what does each level cost in latency and availability?
- What is fundamentally impossible to guarantee when the network can partition?
- How can a system recover to a consistent state after failures?
Key theories
- State-machine replication
- A deterministic service is made fault-tolerant by running identical replicas that process the same sequence of commands in the same order, so that surviving replicas can mask the failure of others.
- CAP theorem
- When the network can partition, a replicated service cannot simultaneously guarantee strong consistency and availability; designers must choose which to sacrifice during a partition, a trade-off formalized by Gilbert and Lynch.
- Rollback recovery
- By periodically checkpointing state and optionally logging messages, a system can roll failed processes back to a consistent recovery line and replay forward, recovering without restarting the whole computation.
Clinical relevance
Replication and fault tolerance are what make cloud storage durable and services highly available; the consistency models and CAP trade-offs studied here directly determine the guarantees offered by databases, object stores, and coordination services used everywhere in production.
History
Schneider's 1990 tutorial codified the state-machine approach to replication; Brewer's CAP conjecture (2000), proved by Gilbert and Lynch in 2002, framed the consistency-availability debate that shaped the NoSQL era; and surveys such as Elnozahy and colleagues' consolidated decades of rollback-recovery research.
Debates
- Strong versus eventual consistency
- Strong consistency simplifies application reasoning but limits availability and raises latency under partitions; eventual consistency maximizes availability at the cost of exposing temporary disagreement, and the right choice depends on application semantics.
Key figures
- Fred Schneider
- Leslie Lamport
- Eric Brewer
- Seth Gilbert
- Nancy Lynch
Related topics
Seminal works
- schneider1990
- gilbert2002
- elnozahy2002
Frequently asked questions
- Does replication always improve reliability?
- Only if the replicas are kept consistent and fail independently. Poorly coordinated replicas can diverge and serve conflicting data, and correlated failures (shared power, software bugs) defeat redundancy, so replication must be paired with the right consistency protocol.