ScholarGate
Asistent

Timing and Failure Models

Timing and failure models specify what a distributed algorithm may assume about message delays and processor speeds, and how components are permitted to fail.

Pronađite temu uz PaperMindUskoroFind papers & topics
Tools & resources
Preuzmi slajdove
Learn & explore
VideoUskoro

Definition

A timing model fixes the assumptions about upper bounds on message delivery time and relative process speed, while a failure model fixes the set of ways in which processes and channels may deviate from their specified behavior.

Scope

This topic covers the synchronous, asynchronous, and partially synchronous timing models; the taxonomy of failures from crash (fail-stop) through omission and timing to arbitrary (Byzantine) faults; and the abstraction of failure detectors that bridges asynchronous systems and timeout-based reasoning. These models are the axioms from which both possibility and impossibility results are derived.

Core questions

  • What bounds on delay and speed may an algorithm assume, and how do timeouts depend on them?
  • Which failure classes—crash, omission, timing, Byzantine—must a protocol mask?
  • How can asynchronous systems be augmented with failure detectors to circumvent impossibility results?

Key theories

Partial synchrony
Real systems are neither fully synchronous nor fully asynchronous; the partially synchronous model assumes bounds on delay and speed that hold eventually or are unknown, which is sufficient to solve consensus while remaining realistic.
Failure-model hierarchy
Failures range from benign fail-stop crashes, through send/receive omissions and timing violations, to arbitrary Byzantine behavior; the severity of the failures a protocol must tolerate dictates the required replication factor and message complexity.
Unreliable failure detectors
An abstract failure detector provides possibly incorrect hints about which processes have crashed; characterizing the weakest detector sufficient to solve consensus reconciles asynchronous rigor with practical timeout-based implementations.

Clinical relevance

Production systems implicitly pick a timing and failure model whenever they set timeouts, choose a replication factor, or decide whether to defend against malicious participants; getting these assumptions wrong is a common root cause of split-brain and data-loss incidents.

History

After the asynchronous model was shown too weak for fault-tolerant consensus, Dwork, Lynch, and Stockmeyer introduced partial synchrony in 1988, and Chandra and Toueg formalized unreliable failure detectors in 1996, together providing the modeling tools that make practical fault-tolerant agreement possible.

Debates

Are timeouts a timing assumption or a failure detector?
One view treats timeouts as encoding an (eventual) synchrony bound; another treats them as an implementation of an abstract failure detector. The two framings are largely equivalent but emphasize different design boundaries between the network model and the algorithm.

Key figures

  • Cynthia Dwork
  • Nancy Lynch
  • Larry Stockmeyer
  • Tushar Chandra
  • Sam Toueg

Related topics

Seminal works

  • dwork1988
  • chandra1996
  • lynch1996

Frequently asked questions

Why can't reliable failure detection be implemented in a purely asynchronous system?
Without bounds on message delay, an arbitrarily slow but live process is indistinguishable from a crashed one, so any detector must sometimes be wrong. This is why asynchronous systems are augmented with timing assumptions or unreliable detectors.

Methods for this concept

Related concepts