Font Size: a A A

Study And Implementation Of Application-Level Checkpointing

Posted on:2009-12-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:P F WangFull Text:PDF
GTID:1118360278956591Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increase of the size of high-performance computer system and the development of COTS manufacturing technology, high-performance computer is faced with severe reliability challenge. Application-level checkpointing is a key technology to deal with the challenge. However, at present, the potential proformance advantage of application-level checkpointing can not be exploited fully and this technology is not easy to use.The goal of this thesis is to address an efficient and easy-to-use application-level checkpointing technology for high-performance computing. We focus on several important issues about application-level checkpointing, including the optimization method of data saving, the consistency issue in application-level checkpoints, the optimization method for minimizing the total checkpoint size, and the easy-to-use issue. Primary contributions of this paper can be summarized as follows:1. To address the problem that existing analysis methods can not deal with MPI programs precisely, we defined the intra- and inter- definition-use relationship in a MPI program and proposed the Live-variable Analysis for MPI Programs (LAMP for short). LAMP overcomes the defects of conventional live-variable analysis methods that they can not distinguish the different liveness of the same variable in different processes and can not analyze the inter-definition-use relationship in a MPI program. LAMP is the basic technology for optimizing the checkpoint size and checkpoint overhead.2. We discussed the optimization method of data saving for application-level checkpointing thoroughly. Firstly, we analyzed the composition of a MPI program's computation state and decided to optimize processes' computation state which is the main part of checkpoint data. Secondly, we proposed a new application-level checkpointing technology based on LAMP. Experiment results showed that the technology can efficiently decrease the checkpoint size and checkpoint overhead.3. We lucubrated the consistency issue of application-level checkpoints. We proposed a new method to maintain the consistency of application-level checkpoints by compiling analysis. The mehtod does not need to log any early or late messages during checkpointing. It finds out the safe checkpointing regions in a MPI program by static analysis. No message needs to log when a checkpoitning occurs in a safe checkpointing region and the recovery process is simple. Experiment results showed that the method is simple and efficient.4. We discussed the optimization of total overhead of multiple checkpoints. We proposed an optimal placement of multiple checkpoints. Because the most overhead of application-level checkpoitning is the time of writing checkpoint files, we firstly simplified the issue of minimizing the checkpoint overhead to the issue of minimzing the checkpoint size, then abstracted the problem of optimal placement of multiple checkpoints into a mathematic model which is similar to the 0-1 integer programming model and lastly we gave two algorithms of the model.5. We addressed the easy-to-use issue of application-level checkpoitning. We designed and implemented a souce-to-source precompiler ALEC, which can translate a Fortran77/MPI program into its fault-tolerant version with efficient application-level checkpointing feature in a simple way.
Keywords/Search Tags:high-performance computing, fault-tolerance, application-level checkpointing, live-variable analysis for MPI programs, consistent issue, LAMP, ALEC
PDF Full Text Request
Related items