Font Size: a A A

Parallel Computing Environment Based On The Volume Of The Checkpoint Recovery Technology Research

Posted on:2011-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2208330332977373Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the expending of the application field of parallel computing systems, it is important to provide higher reliability for these systems. Especially, when running large scale parallel programs with out necessary fault tolerance mechanisms, even one abnormal event of single process or a processor is likely to cause the whole program must to restart from the very beginning manually. But most of current parallel computing environment do not provide the fault tolerance mechanisms, resulting in wasting of a lot of time when abnormal event occurred.Checkpoints based rollback-recovery technique can improve the fault tolerance capability of programs effectively. However, setting checkpoints for parallel computing programs based on the message passing systems is a big challenge, because there are coupling caused by message passing between processes. How to insure the parallel program roll back to a right state when abnormal event occurred, and reduce the overall time cost of the rollback-recovery mechanism at the same time, is an important and difficult problem for large scale parallel programs.Firstly, several classical checkpoints based rollback-recovery protocols for parallel programs in message passing systems have been studied. In order to reduce the time cost caused by coordinating messages and processes blocking when setting checkpoints, the feasible global states based non-blocking cooperative checkpoint protocol is also proposed. The proposed protocol takes advantage of the fact that, when running parallel programs, checkpoints setting happens much more often than fault recovery. Therefore, we try to transfer the most coordination and synchronization operations of checkpoint setting stage into the rollback-recovery stage through some non-blocking techniques, thus effectively reduces overall time cost of the checkpoint mechanism.Secondly, the process management component Multi-Purpose Daemon (MPD) of a commonly used parallel software development environment MPICH2 has been studied. Furthermore, we implement fault detection and rollback recovery functions for MPD. Based on its'own event management mechanisms, the MPD supervises the states of all processors and processes periodically, in order to detect abnormal events. When abnormal events occurred, the program can resume to a right states based on the checkpoints latterly saved by rollback-recovery protocol, and resume running from this states rather than the very beginning of whole program.Finally, based on the rollback-recovery protocols and fault detection mechanism studied in this paper, we discussed how to provide the fault tolerance capability for MPICH2. Furthermore, we use NAS Parallel Benchmarks to compare the protocol proposed in this paper with other protocols under the MPICH2 environment. The simulation results indicate that, our protocol has the least time cost.
Keywords/Search Tags:parallel systems, check-points, rollback-recovery protocols, process management, coordinated messages, non-blocking
PDF Full Text Request
Related items