Consensus and Coordination
Consensus and coordination address how independent processes that communicate only by messages can agree on a common value or action despite delays and failures.
Definition
Consensus is the problem of having a set of processes, each proposing a value, all decide on a single common value such that the decision is valid, all correct processes agree, and every correct process eventually decides, even when some processes fail.
Scope
This area covers the consensus problem and its variants (agreement, atomic broadcast, validity, termination), the foundational FLP impossibility result for deterministic asynchronous consensus, practical consensus protocols such as Paxos and Raft, Byzantine fault-tolerant agreement, and the classical coordination problems of mutual exclusion and leader election. It is the theoretical and practical core of fault-tolerant distributed computing.
Sub-topics
Core questions
- Under what timing and failure models is consensus solvable, and when is it provably impossible?
- How do practical protocols circumvent the FLP impossibility while preserving safety?
- How much replication is required to tolerate crash versus Byzantine failures?
- How can processes coordinate access to shared resources or elect a leader reliably?
Key theories
- FLP impossibility
- In a fully asynchronous system, no deterministic protocol can guarantee consensus if even a single process may crash, because a slow process cannot be distinguished from a failed one; this result motivates partial synchrony, randomization, and failure detectors.
- Quorum-based consensus
- Protocols such as Paxos achieve agreement by requiring overlapping majorities (quorums) to accept proposals, ensuring that any two decisions intersect at a correct process and thereby remain consistent across rounds and leader changes.
- Byzantine agreement bounds
- To reach agreement when up to f processes may behave arbitrarily, at least 3f+1 processes are required in the classic synchronous setting, a tight bound that shapes the design of all Byzantine fault-tolerant systems.
Clinical relevance
Consensus is the backbone of reliable infrastructure: coordination services, replicated databases, distributed locks, and blockchains all run a consensus protocol to keep replicas consistent, making this area directly responsible for the durability and availability guarantees of modern cloud systems.
History
The 1982 Byzantine generals paper and the 1985 FLP impossibility result framed the limits of agreement; Lamport's Paxos (circulated in 1989, published in 1998) gave a practical asynchronous-safe protocol, and later work on Byzantine fault tolerance and Raft made consensus both robust against malicious faults and accessible to implementers.
Debates
- Does the FLP impossibility make consensus unattainable in practice?
- FLP rules out a deterministic protocol that is always both safe and live in a fully asynchronous model, but practical systems sidestep it by assuming eventual synchrony or using randomization, keeping safety unconditional and achieving liveness whenever the network behaves.
Key figures
- Leslie Lamport
- Nancy Lynch
- Michael Fischer
- Michael Paterson
- Barbara Liskov
Related topics
Seminal works
- fischer1985
- lamport1998
- lamport1982byz
Frequently asked questions
- Why is consensus considered the hardest problem in distributed systems?
- Because many other coordination tasks—atomic broadcast, replicated state machines, distributed locking—reduce to consensus, and consensus itself is provably impossible to solve deterministically in the asynchronous crash-failure model, so any solution must make careful timing or randomness assumptions.
- How does consensus relate to replicated databases?
- A replicated database keeps its replicas consistent by agreeing on the order of operations, which is exactly a sequence of consensus decisions; this is why systems like distributed key-value stores embed a consensus protocol at their core.