Why does the timing model matter so much?

Because it determines whether timeouts can be trusted. In a synchronous model bounded delays let a process safely conclude that a silent peer has failed; in an asynchronous model a slow process and a crashed one are indistinguishable, which is the root cause of several famous impossibility results.

What is a Byzantine failure?

A Byzantine (arbitrary) failure is one in which a faulty component may behave in any way at all, including sending conflicting or malicious messages. Tolerating it is far more expensive than tolerating simple crashes and requires specialized agreement protocols.

Distributed System Models

Distributed system models are the abstract assumptions—about architecture, timing, communication, and failures—that define what a distributed algorithm can rely on and what it must tolerate.

מציאת נושא עם PaperMindבקרובFind papers & topics

Tools & resources

הורדת מצגת

Learn & explore

וידאובקרוב

Definition

A distributed system is a collection of independent computers that communicate only by exchanging messages and that appears to its users as a single coherent system; a system model is the set of assumptions about processes, communication channels, timing, and failures under which such a system is analyzed.

Scope

This area covers the architectural and physical models of distributed systems (clients, servers, peers, and multitier organizations), the timing models that distinguish synchronous from asynchronous execution, the fundamental failure models (crash, omission, timing, and Byzantine), and the communication abstractions of message passing, shared memory, remote invocation, and middleware. These models frame every result in the field: an algorithm correct under one model may be impossible under another, so making the model explicit is a prerequisite for reasoning about correctness and performance.

Sub-topics

Core questions

What assumptions about timing, communication, and failures does a given distributed algorithm require?
How do synchronous and asynchronous models differ, and why does the distinction change what is computable?
What classes of process and channel failures must a protocol tolerate to be correct?
When should a system be structured around message passing versus a shared-memory or remote-invocation abstraction?

Key theories

Synchronous versus asynchronous models: In a synchronous model there are known bounds on message delay and relative processor speed, allowing the use of timeouts to detect failures; in an asynchronous model no such bounds exist, which makes failure detection fundamentally unreliable and underlies many impossibility results.
Failure model hierarchy: Process and channel failures are classified from benign to severe—crash (fail-stop), omission, timing, and arbitrary (Byzantine)—with stronger guarantees needed to mask more severe failures; the model chosen determines both the achievable resilience and the cost of a protocol.
Communication abstractions: Distributed computation is built on a small set of interaction primitives—asynchronous and synchronous message passing, distributed shared memory, and remote procedure or method invocation—each with distinct semantics for delivery, ordering, and failure that shape higher-level design.

Clinical relevance

Choosing the right model is the first design decision in any real system: cloud platforms, databases, and coordination services all declare an (often partially synchronous) timing model and a failure model, and these choices determine which consistency, availability, and fault-tolerance guarantees the system can promise.

History

Early distributed systems research in the 1970s and 1980s sought to identify the minimal assumptions under which distributed coordination is possible, producing the synchronous/asynchronous dichotomy and a taxonomy of failures. These models were consolidated in textbooks by Lynch, Attiya and Welch, Tanenbaum and van Steen, and Coulouris and colleagues, becoming the shared vocabulary for the entire field.

Debates

How realistic is the asynchronous model for practical systems?: The pure asynchronous model is provably the hardest to program in and rules out reliable failure detection, yet most real networks are only intermittently slow; partially synchronous models and failure detectors emerged as a pragmatic middle ground that retains rigor while admitting timeouts.

Key figures

Leslie Lamport
Nancy Lynch
Andrew S. Tanenbaum
Maarten van Steen

Seminal works

lynch1996
tanenbaum2017
attiya2004

Frequently asked questions

Why does the timing model matter so much?: Because it determines whether timeouts can be trusted. In a synchronous model bounded delays let a process safely conclude that a silent peer has failed; in an asynchronous model a slow process and a crashed one are indistinguishable, which is the root cause of several famous impossibility results.
What is a Byzantine failure?: A Byzantine (arbitrary) failure is one in which a faulty component may behave in any way at all, including sending conflicting or malicious messages. Tolerating it is far more expensive than tolerating simple crashes and requires specialized agreement protocols.