Parallel Computing Environment Based On The Volume Of The Checkpoint Recovery Technology Research

Posted on:2011-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Sun

Full Text:PDF

GTID:2208330332977373

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the expending of the application field of parallel computing systems, it is important to provide higher reliability for these systems. Especially, when running large scale parallel programs with out necessary fault tolerance mechanisms, even one abnormal event of single process or a processor is likely to cause the whole program must to restart from the very beginning manually. But most of current parallel computing environment do not provide the fault tolerance mechanisms, resulting in wasting of a lot of time when abnormal event occurred.Checkpoints based rollback-recovery technique can improve the fault tolerance capability of programs effectively. However, setting checkpoints for parallel computing programs based on the message passing systems is a big challenge, because there are coupling caused by message passing between processes. How to insure the parallel program roll back to a right state when abnormal event occurred, and reduce the overall time cost of the rollback-recovery mechanism at the same time, is an important and difficult problem for large scale parallel programs.Firstly, several classical checkpoints based rollback-recovery protocols for parallel programs in message passing systems have been studied. In order to reduce the time cost caused by coordinating messages and processes blocking when setting checkpoints, the feasible global states based non-blocking cooperative checkpoint protocol is also proposed. The proposed protocol takes advantage of the fact that, when running parallel programs, checkpoints setting happens much more often than fault recovery. Therefore, we try to transfer the most coordination and synchronization operations of checkpoint setting stage into the rollback-recovery stage through some non-blocking techniques, thus effectively reduces overall time cost of the checkpoint mechanism.Secondly, the process management component Multi-Purpose Daemon (MPD) of a commonly used parallel software development environment MPICH2 has been studied. Furthermore, we implement fault detection and rollback recovery functions for MPD. Based on its'own event management mechanisms, the MPD supervises the states of all processors and processes periodically, in order to detect abnormal events. When abnormal events occurred, the program can resume to a right states based on the checkpoints latterly saved by rollback-recovery protocol, and resume running from this states rather than the very beginning of whole program.Finally, based on the rollback-recovery protocols and fault detection mechanism studied in this paper, we discussed how to provide the fault tolerance capability for MPICH2. Furthermore, we use NAS Parallel Benchmarks to compare the protocol proposed in this paper with other protocols under the MPICH2 environment. The simulation results indicate that, our protocol has the least time cost.

Keywords/Search Tags:

parallel systems, check-points, rollback-recovery protocols, process management, coordinated messages, non-blocking

PDF Full Text Request

Related items

1	Research On Key Technology Of Coordinated Rollback-recovery Protocols In Cloud Platform
2	Dynamic Cluster Strategy For Hierarchical Rollback-Recovery Protocols
3	Research Of Rollback Recovery Based On Dependency Tracking And Message Counting
4	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
5	Research On Rollback Recovery Technology In Distributed Systems
6	Research On Non-Blocking Coordinated Checkpointing Algorithm In The Mobile Computing Environment
7	Research On Message Log Recovery Algorithm Based On Message Reordering And Message Number Check
8	Cluster Oriented Fault Tolerance For MPI Parallel Applications
9	Fault-Tolerant Of MPI Programs Based On Rollback Recovery
10	Research On Incremental Checkpointing And Rollback Recovery