Distributed system fault tolerance using message logging and checkpointing
Johnson, David Bruce
Doctor of Philosophy
Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.