Why can't reliable failure detection be implemented in a purely asynchronous system?

Without bounds on message delay, an arbitrarily slow but live process is indistinguishable from a crashed one, so any detector must sometimes be wrong. This is why asynchronous systems are augmented with timing assumptions or unreliable detectors.

Timing and Failure Models

Timing and failure models specify what a distributed algorithm may assume about message delays and processor speeds, and how components are permitted to fail.

Pronađite temu uz PaperMindUskoroFind papers & topics

Tools & resources

Preuzmi slajdove

Learn & explore

VideoUskoro

Definition

A timing model fixes the assumptions about upper bounds on message delivery time and relative process speed, while a failure model fixes the set of ways in which processes and channels may deviate from their specified behavior.

Scope

This topic covers the synchronous, asynchronous, and partially synchronous timing models; the taxonomy of failures from crash (fail-stop) through omission and timing to arbitrary (Byzantine) faults; and the abstraction of failure detectors that bridges asynchronous systems and timeout-based reasoning. These models are the axioms from which both possibility and impossibility results are derived.

Core questions

What bounds on delay and speed may an algorithm assume, and how do timeouts depend on them?
Which failure classes—crash, omission, timing, Byzantine—must a protocol mask?
How can asynchronous systems be augmented with failure detectors to circumvent impossibility results?

Key theories

Partial synchrony: Real systems are neither fully synchronous nor fully asynchronous; the partially synchronous model assumes bounds on delay and speed that hold eventually or are unknown, which is sufficient to solve consensus while remaining realistic.
Failure-model hierarchy: Failures range from benign fail-stop crashes, through send/receive omissions and timing violations, to arbitrary Byzantine behavior; the severity of the failures a protocol must tolerate dictates the required replication factor and message complexity.
Unreliable failure detectors: An abstract failure detector provides possibly incorrect hints about which processes have crashed; characterizing the weakest detector sufficient to solve consensus reconciles asynchronous rigor with practical timeout-based implementations.

Clinical relevance

Production systems implicitly pick a timing and failure model whenever they set timeouts, choose a replication factor, or decide whether to defend against malicious participants; getting these assumptions wrong is a common root cause of split-brain and data-loss incidents.

History

After the asynchronous model was shown too weak for fault-tolerant consensus, Dwork, Lynch, and Stockmeyer introduced partial synchrony in 1988, and Chandra and Toueg formalized unreliable failure detectors in 1996, together providing the modeling tools that make practical fault-tolerant agreement possible.

Debates

Are timeouts a timing assumption or a failure detector?: One view treats timeouts as encoding an (eventual) synchrony bound; another treats them as an implementation of an abstract failure detector. The two framings are largely equivalent but emphasize different design boundaries between the network model and the algorithm.

Key figures

Cynthia Dwork
Nancy Lynch
Larry Stockmeyer
Tushar Chandra
Sam Toueg

Seminal works

dwork1988
chandra1996
lynch1996

Frequently asked questions

Why can't reliable failure detection be implemented in a purely asynchronous system?: Without bounds on message delay, an arbitrarily slow but live process is indistinguishable from a crashed one, so any detector must sometimes be wrong. This is why asynchronous systems are augmented with timing assumptions or unreliable detectors.