ScholarGate
Assistent

Checkpointing and Recovery

Checkpointing periodically saves a system's state so that after a failure it can roll back to a consistent point and resume, rather than restarting from scratch.

Troba un tema amb PaperMindAviatFind papers & topics
Tools & resources
Baixa les diapositives
Learn & explore
VídeoAviat

Definition

Checkpointing records the state of one or more processes to stable storage; rollback recovery uses these checkpoints, possibly with logged messages, to restore the system to a consistent global state—a recovery line—after a failure and replay forward from there.

Scope

This topic covers checkpoint-based and log-based rollback recovery: uncoordinated, coordinated, and communication-induced checkpointing; the domino effect that uncoordinated checkpoints can cause; and pessimistic, optimistic, and causal message logging that allow recovery beyond the last checkpoint. It connects to the consistent-cut theory of global snapshots.

Core questions

  • How can checkpoints across processes be combined into a consistent recovery line?
  • What is the domino effect and how does coordination prevent it?
  • When does message logging allow recovery past the most recent checkpoint?

Key theories

Coordinated checkpointing
Processes coordinate so that their checkpoints together form a consistent global state, guaranteeing a usable recovery line and avoiding cascading rollbacks at the cost of synchronization overhead.
Uncoordinated checkpointing and the domino effect
If processes checkpoint independently, recovery may require rolling each back to find a consistent set, potentially cascading all the way to the start (the domino effect), which coordination or logging is designed to avoid.
Message logging
Logging the messages a process receives (pessimistically, optimistically, or causally) lets a recovering process replay them deterministically and advance past its last checkpoint, recovering recent work without global rollback.

Clinical relevance

Checkpoint/restart keeps long-running high-performance and scientific computations resilient to node failures, and asynchronous checkpointing gives modern stream-processing systems their exactly-once fault-recovery guarantees.

History

Building on Chandy and Lamport's consistent-snapshot theory, Koo and Toueg formalized coordinated checkpointing in 1987, and decades of work on logging and uncoordinated schemes were consolidated in Elnozahy and colleagues' 2002 survey, the standard reference on rollback recovery.

Debates

Coordinated versus uncoordinated checkpointing
Coordinated checkpointing guarantees a clean recovery line but adds synchronization cost and global coordination; uncoordinated checkpointing is cheaper at checkpoint time but risks the domino effect and complex recovery, so the right choice depends on failure rate and scale.

Key figures

  • K. Mani Chandy
  • Leslie Lamport
  • Sam Toueg
  • Lorenzo Alvisi

Related topics

Seminal works

  • elnozahy2002
  • koo1987
  • chandy1985

Frequently asked questions

What is the domino effect in rollback recovery?
When processes checkpoint without coordination, rolling one back can force a dependent process to roll back too, which can cascade backward through the whole computation—potentially to the very beginning. Coordinated checkpointing or message logging is used to prevent it.

Methods for this concept

Related concepts