What is the domino effect in rollback recovery?

When processes checkpoint without coordination, rolling one back can force a dependent process to roll back too, which can cascade backward through the whole computation—potentially to the very beginning. Coordinated checkpointing or message logging is used to prevent it.

Checkpointing and Recovery

Checkpointing periodically saves a system's state so that after a failure it can roll back to a consistent point and resume, rather than restarting from scratch.

Troba un tema amb PaperMindAviatFind papers & topics

Tools & resources

Baixa les diapositives

Learn & explore

VídeoAviat

Definition

Checkpointing records the state of one or more processes to stable storage; rollback recovery uses these checkpoints, possibly with logged messages, to restore the system to a consistent global state—a recovery line—after a failure and replay forward from there.

Scope

This topic covers checkpoint-based and log-based rollback recovery: uncoordinated, coordinated, and communication-induced checkpointing; the domino effect that uncoordinated checkpoints can cause; and pessimistic, optimistic, and causal message logging that allow recovery beyond the last checkpoint. It connects to the consistent-cut theory of global snapshots.

Core questions

How can checkpoints across processes be combined into a consistent recovery line?
What is the domino effect and how does coordination prevent it?
When does message logging allow recovery past the most recent checkpoint?

Key theories

Coordinated checkpointing: Processes coordinate so that their checkpoints together form a consistent global state, guaranteeing a usable recovery line and avoiding cascading rollbacks at the cost of synchronization overhead.
Uncoordinated checkpointing and the domino effect: If processes checkpoint independently, recovery may require rolling each back to find a consistent set, potentially cascading all the way to the start (the domino effect), which coordination or logging is designed to avoid.
Message logging: Logging the messages a process receives (pessimistically, optimistically, or causally) lets a recovering process replay them deterministically and advance past its last checkpoint, recovering recent work without global rollback.

Clinical relevance

Checkpoint/restart keeps long-running high-performance and scientific computations resilient to node failures, and asynchronous checkpointing gives modern stream-processing systems their exactly-once fault-recovery guarantees.

History

Building on Chandy and Lamport's consistent-snapshot theory, Koo and Toueg formalized coordinated checkpointing in 1987, and decades of work on logging and uncoordinated schemes were consolidated in Elnozahy and colleagues' 2002 survey, the standard reference on rollback recovery.

Debates

Coordinated versus uncoordinated checkpointing: Coordinated checkpointing guarantees a clean recovery line but adds synchronization cost and global coordination; uncoordinated checkpointing is cheaper at checkpoint time but risks the domino effect and complex recovery, so the right choice depends on failure rate and scale.

Key figures

K. Mani Chandy
Leslie Lamport
Sam Toueg
Lorenzo Alvisi

Seminal works

elnozahy2002
koo1987
chandy1985

Frequently asked questions

What is the domino effect in rollback recovery?: When processes checkpoint without coordination, rolling one back can force a dependent process to roll back too, which can cascade backward through the whole computation—potentially to the very beginning. Coordinated checkpointing or message logging is used to prevent it.