Byzantine Fault Tolerance
Byzantine fault tolerance is the study of reaching agreement when some processes may fail arbitrarily, including by sending conflicting or malicious messages.
Definition
A Byzantine fault is an arbitrary deviation from a component's specification, including malicious behavior; Byzantine fault tolerance is the ability of a distributed protocol to satisfy its correctness conditions despite a bounded number of such faulty participants.
Scope
This topic covers the Byzantine generals problem and the tight resilience bounds it establishes (at least 3f+1 processes, or signed messages, to tolerate f arbitrary faults), the synchronous oral- and signed-message agreement algorithms, and practical asynchronous Byzantine fault-tolerant protocols such as PBFT and their descendants. It underlies the trust models of permissioned blockchains and high-integrity replicated services.
Core questions
- How many correct processes are needed to tolerate a given number of arbitrary faults?
- How do digital signatures change the resilience bounds for Byzantine agreement?
- How can Byzantine agreement be made efficient enough for practical asynchronous systems?
Key theories
- Byzantine generals bound
- With unauthenticated (oral) messages, Byzantine agreement is solvable if and only if more than two-thirds of the processes are correct (n > 3f); authenticated messages with unforgeable signatures relax this requirement.
- Practical BFT replication
- PBFT shows that Byzantine agreement can run in a partially synchronous system with state-machine replication using a primary and a three-phase protocol, achieving practical performance while tolerating up to one-third faulty replicas.
- Synchronous agreement algorithms
- In the synchronous model, recursive oral-message and signed-message algorithms achieve agreement in f+1 rounds, illustrating the round-complexity cost of tolerating arbitrary faults.
Clinical relevance
Byzantine fault tolerance is the foundation of systems that must withstand not just crashes but compromise or malicious behavior, including permissioned blockchains, high-assurance replicated services, and avionics and aerospace control where arbitrary component failure cannot be ruled out.
History
Pease, Shostak, and Lamport established the agreement bounds in 1980 and dramatized them as the Byzantine generals problem in 1982; for nearly two decades Byzantine agreement was considered too costly for practice until Castro and Liskov's 1999 PBFT demonstrated efficient asynchronous Byzantine replication, reviving the field and later informing blockchain consensus.
Key figures
- Leslie Lamport
- Robert Shostak
- Marshall Pease
- Miguel Castro
- Barbara Liskov
Related topics
Seminal works
- lamport1982byz
- castro1999
- pease1980
Frequently asked questions
- Why does tolerating f Byzantine faults require 3f+1 processes?
- Correct processes must reach a majority decision even when f faulty processes lie and another f may be slow or unreachable; only with more than three times the number of faulty processes can the correct majority always outvote the combination of faulty and lagging participants.