Checkpointing and Recovery
Checkpointing periodically saves a system's state so that after a failure it can roll back to a consistent point and resume, rather than restarting from scratch.
Definition
Checkpointing records the state of one or more processes to stable storage; rollback recovery uses these checkpoints, possibly with logged messages, to restore the system to a consistent global state—a recovery line—after a failure and replay forward from there.
Scope
This topic covers checkpoint-based and log-based rollback recovery: uncoordinated, coordinated, and communication-induced checkpointing; the domino effect that uncoordinated checkpoints can cause; and pessimistic, optimistic, and causal message logging that allow recovery beyond the last checkpoint. It connects to the consistent-cut theory of global snapshots.
Core questions
- How can checkpoints across processes be combined into a consistent recovery line?
- What is the domino effect and how does coordination prevent it?
- When does message logging allow recovery past the most recent checkpoint?
Key theories
- Coordinated checkpointing
- Processes coordinate so that their checkpoints together form a consistent global state, guaranteeing a usable recovery line and avoiding cascading rollbacks at the cost of synchronization overhead.
- Uncoordinated checkpointing and the domino effect
- If processes checkpoint independently, recovery may require rolling each back to find a consistent set, potentially cascading all the way to the start (the domino effect), which coordination or logging is designed to avoid.
- Message logging
- Logging the messages a process receives (pessimistically, optimistically, or causally) lets a recovering process replay them deterministically and advance past its last checkpoint, recovering recent work without global rollback.
Clinical relevance
Checkpoint/restart keeps long-running high-performance and scientific computations resilient to node failures, and asynchronous checkpointing gives modern stream-processing systems their exactly-once fault-recovery guarantees.
History
Building on Chandy and Lamport's consistent-snapshot theory, Koo and Toueg formalized coordinated checkpointing in 1987, and decades of work on logging and uncoordinated schemes were consolidated in Elnozahy and colleagues' 2002 survey, the standard reference on rollback recovery.
Debates
- Coordinated versus uncoordinated checkpointing
- Coordinated checkpointing guarantees a clean recovery line but adds synchronization cost and global coordination; uncoordinated checkpointing is cheaper at checkpoint time but risks the domino effect and complex recovery, so the right choice depends on failure rate and scale.
Key figures
- K. Mani Chandy
- Leslie Lamport
- Sam Toueg
- Lorenzo Alvisi
Related topics
Seminal works
- elnozahy2002
- koo1987
- chandy1985
Frequently asked questions
- What is the domino effect in rollback recovery?
- When processes checkpoint without coordination, rolling one back can force a dependent process to roll back too, which can cascade backward through the whole computation—potentially to the very beginning. Coordinated checkpointing or message logging is used to prevent it.