State Machine Replication
State machine replication makes a service fault-tolerant by running deterministic replicas that execute the same commands in the same order, so survivors mask the failure of others.
Definition
State machine replication is a technique in which a service is modeled as a deterministic state machine and replicated across several servers that all apply the same ordered sequence of client commands, ensuring that non-faulty replicas remain in identical states and can substitute for failed ones.
Scope
This topic covers the state-machine model of a service, the requirement that replicas be deterministic, the use of total-order (atomic) broadcast or consensus to agree on the command sequence, and the contrast with primary-backup (passive) replication. It also covers handling non-determinism, replica reconfiguration, and the extension to Byzantine replicas.
Core questions
- Why must replicas be deterministic and process commands in an identical order?
- How is the agreed command order produced from consensus or atomic broadcast?
- How does active (state-machine) replication differ from passive primary-backup replication?
Key theories
- The state-machine approach
- Any deterministic service can be made fault-tolerant by replicating it and feeding every replica the same ordered command stream; provided fewer than the tolerated number of replicas fail, the service remains correct and available.
- Ordering via consensus
- The required total order on commands is exactly an atomic broadcast, which is equivalent to consensus, so replicated state machines are typically built on a consensus protocol such as Paxos or Raft.
- Byzantine state-machine replication
- Extending the approach to tolerate arbitrary faults requires agreement among more than two-thirds of replicas and result voting by clients, as realized in practical Byzantine fault-tolerant replication.
Clinical relevance
State machine replication is the standard way to build a highly available, strongly consistent service; replicated logs, coordination services, and many distributed databases are state machines driven by a consensus-ordered command stream.
History
Lamport's logical-clock paper sketched the replicated-state-machine idea in 1978; Schneider's 1990 tutorial made it the canonical framework for fault-tolerant services; and Castro and Liskov's 1999 PBFT extended it to the Byzantine setting, a line that later influenced blockchain designs.
Debates
- Active versus passive replication
- Active (state-machine) replication executes commands at all replicas for fast failover but wastes work and requires determinism; passive primary-backup replication executes only at a primary and ships state, saving work but incurring failover delay and a single execution point.
Key figures
- Fred Schneider
- Leslie Lamport
- Miguel Castro
- Barbara Liskov
Related topics
Seminal works
- schneider1990
- lamport1998
- castro1999
Frequently asked questions
- Why must the replicated service be deterministic?
- If replicas could make different choices on the same input—because of randomness, timing, or thread scheduling—their states would diverge despite seeing identical commands. Determinism guarantees that the same ordered command stream produces the same state everywhere.