Font Size: a A A

Checkpointing a multithreaded distributed shared memory computer system

Posted on:2002-09-07Degree:Ph.DType:Dissertation
University:University of KentuckyCandidate:Dieter, William RobertFull Text:PDF
GTID:1468390011996806Subject:Computer Science
Abstract/Summary:
Distributing a program over a cluster of commodity processors connected by a commodity network can help speed up a computation for a relatively low cost. Distributed cluster computing is especially useful for long-running scientific applications. As the number of processors and running time of program increase, however, the probability of that one of the system's components will fail before the program ends increases. A program can prepare for failures by periodically saving its state in a checkpoint from which it can be recovered later.; Checkpointing distributed programs requires making sure the checkpoints that individual processes save can be used together to restore a consistent state. Programs using a coordinated checkpointing algorithm communicate to save a consistent state. Programs using a communication-induced checkpointing algorithm build a consistent state without explicit communication. Although communication induced checkpointing algorithms have less communication overhead they do not add significantly less overhead to programs because synchronization overhead is small compared to the amount of time required to save a checkpoint to disk.; A checkpointing system builds consistent global checkpoints from checkpoints of individual processes. Each Unify process has multiple threads, but no checkpointing library existed that could checkpoint multi-threaded programs at the start of this research. This research includes the development of a checkpointing library to checkpoint multithreaded processes on Solaris 2.5 and Linux. The checkpointing library can be used as a standalone checkpointing library for multithreaded processes in addition to being used by Unify.
Keywords/Search Tags:Checkpointing, Multithreaded, Distributed, Processes, Program
Related items