Research On Fault-Tolerance Technology For Message-Passing System

Posted on:2007-11-25

Degree:Master

Type:Thesis

Country:China

Candidate:G W Wan

Full Text:PDF

GTID:2178360215470250

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Cluster systems are popular environments for executing parallel applications due to their extensibility and high performance-price ration in recent years. But with the extending of its field and scale, as well as the development of Grid, it is demanded to provide higher reliability. For the long-running, big scale, and message passing parallel computations, one abnormal event is likely to cause the entire application to fail, and the current message passing systems, such as the Message Passing Interface (MPI), do not provide fault tolerance. To avoid this kind of time waste, it is necessary to research fault tolerance technology for message passing systems to achieve high availability for clusters.Checkpointing & Rollback Recovery (CRR) is a method that avoids the waste of computations accomplished prior to the occurrence of the failure. However, checkpointing a parallel application in message passing systems is more complex than just having each processor take checkpoints independently, because messages induce inter-process dependencies during failure-free operations. How to gain the global consistent recovery states is a diffcult problem in CRR protocol. Furthermore, the failures of nodes and processes may cause the failures of the parallel application, and sometimes the failures of processes may cause hanging application. Therefore, the fault detects and automatic recoveries are important parts of the parallel fault-tolerance.First, rollback-recovery protocols have been studied comprehensively in this thesis, and many current coordinated checkpoint protocols have been analysed and contrasted. According to the analysis of current coordinated checkpoint protocols, blocking and syncronizing messages are two main factors which affect the overhead of coordinated checkpoint protocols. According to the current state and existing problems of coordinated checkpoint protocols, the concept of reconstructable checkpoint and nonblocking coordinated checkpoint protocol based on reconstructable checkpoint are presented. This protocol classifies the state of process into three kind, uses piggybacking messages and non-blocking, and reduces the number of syncronizing messages. This protocol transfers the most overhead of checkpoint into the moment of rollback-recovery. This protocol reduces the overall overhead of the checkpoint mechanism because the checkpoint happens much more often than rollback.Second, a process manager MPD has been analyzed and studied comprehensively. Adding the functions of fault detects and automatic recovery in MPD conquers the problems of restarting by hand and hanging applications. We call the enhanced MPD MPD/FT, which watches the nodes and processes and recover automatically when detecting the failures of nodes and processes.Finally, three coordinated checkpoint protocols have been implemented in MPICH2, and get the overheads of the three protocols from experiment. The experiment results show that the overhead of non-blocking coordinated checkpoint protocol based on reconstructable checkpoint is lower than the other protocols.

Keywords/Search Tags:

Fault-Tolerance, MPI, MPICH2, Checkpointing, Roll back, Fault Dectect, Automatic Recovery

PDF Full Text Request

Related items

1	Achieving Fault-Tolerance And High-Performance In Grid Applications
2	Incorporating File I/O Into Checkpointing Under Clusters Environment
3	Research On Fault-Tolerant Checkpointing Algorithm And In Software Design
4	Research On Incremental Checkpointing And Rollback Recovery
5	Research On Recovery-Oriented Fault-Tolerant Computing Technique
6	Study On Fault-Tolerance Mechanism And Realization In Real-Time Distributed Computer Systems
7	Research On Checkpointing And Rollback Recovery Fault-tolerant Techniques For Mobile Computing Environment
8	Software Implemented Checkpointing Fault Tolerance In On-board Computer
9	Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer
10	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems