ScholarGate
Assistant

Fault Tolerance and Replication

Fault tolerance and replication are the techniques by which distributed systems continue to provide correct service despite the failure of some of their components.

Definition

Fault tolerance is the ability of a system to keep meeting its specification despite faults in some components; replication—maintaining multiple copies of computation or data—is the principal mechanism for achieving it, requiring protocols to keep the copies suitably consistent.

Scope

This area covers redundancy as the basis of fault tolerance, the state-machine replication approach to building reliable services, data replication and the consistency models that govern replicated data, the CAP theorem and the consistency-availability trade-off under partitions, and rollback-recovery techniques based on checkpointing and logging. It connects the theory of consensus and ordering to the construction of dependable systems.

Sub-topics

Core questions

  • How does redundancy turn unreliable components into a reliable service?
  • What consistency must replicas maintain, and what does each level cost in latency and availability?
  • What is fundamentally impossible to guarantee when the network can partition?
  • How can a system recover to a consistent state after failures?

Key theories

State-machine replication
A deterministic service is made fault-tolerant by running identical replicas that process the same sequence of commands in the same order, so that surviving replicas can mask the failure of others.
CAP theorem
When the network can partition, a replicated service cannot simultaneously guarantee strong consistency and availability; designers must choose which to sacrifice during a partition, a trade-off formalized by Gilbert and Lynch.
Rollback recovery
By periodically checkpointing state and optionally logging messages, a system can roll failed processes back to a consistent recovery line and replay forward, recovering without restarting the whole computation.

Clinical relevance

Replication and fault tolerance are what make cloud storage durable and services highly available; the consistency models and CAP trade-offs studied here directly determine the guarantees offered by databases, object stores, and coordination services used everywhere in production.

History

Schneider's 1990 tutorial codified the state-machine approach to replication; Brewer's CAP conjecture (2000), proved by Gilbert and Lynch in 2002, framed the consistency-availability debate that shaped the NoSQL era; and surveys such as Elnozahy and colleagues' consolidated decades of rollback-recovery research.

Debates

Strong versus eventual consistency
Strong consistency simplifies application reasoning but limits availability and raises latency under partitions; eventual consistency maximizes availability at the cost of exposing temporary disagreement, and the right choice depends on application semantics.

Key figures

  • Fred Schneider
  • Leslie Lamport
  • Eric Brewer
  • Seth Gilbert
  • Nancy Lynch

Related topics

Seminal works

  • schneider1990
  • gilbert2002
  • elnozahy2002

Frequently asked questions

Does replication always improve reliability?
Only if the replicas are kept consistent and fail independently. Poorly coordinated replicas can diverge and serve conflicting data, and correlated failures (shared power, software bugs) defeat redundancy, so replication must be paired with the right consistency protocol.

Methods for this concept

Related concepts