Font Size: a A A

Fault-Tolerant Of MPI Programs Based On Rollback Recovery

Posted on:2006-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhouFull Text:PDF
GTID:2168360155962556Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Mechanism of process checkpoint is to save process state periodically when the process is running, and uses the saved state to recover when it is failed, then the computing task continues the execution from checkpointing instant, so reduces the computing loss and improves the reliability and availability of programs. Whereas, in message passing system, parallel processes send messages to one another at free failed time, which makes dependency among them, so it need insure global consistent state when checkpointing to realize correct rollback recovery.Firstly this paper briefly introduces fault-tolerant in message passing system and reviews checkpoint and rollback-recovery technology at present. Through researching on user process space of windows which is the dominant operation system in personal computing, this paper brings forward that saving and recovery of process state can be implemented as two parts of kernel objects and user address space: for user address space, it realizes the constituent recovery of user address space by two-step protocol that firstly recovering layout then recoverying the contents, for kernel object, it uses virtual object mapping table to realize consistent recovery of such objects through intercepting and wrapping kernel API functions invoking, which insures the consistent recovery of user process space.In succession, through researching on message passing interface mechanism, this paper puts forward the idea of slice for processes communicating frequently on MPI network topology structure. Combining and adding correlating codes based on original MPI library functions, it implements MPI library function of slice on general network topology structure, then makes the performance analysis of slice. We can use slice technology to write parallel program more conveniently and effectively.Following, this paper improves fast N+l parity checkpoint algorithm. It reduces a backup machine by using incremental disk checkpoint on parity-computing checkpointing machine, and can tolerate parallel failure of checkpointing machine and one application machine through two-step commiting protocol, then combines cache technology to reduce running time overhead of improving algorithm. Through experiments in MPI parallel environment, the improving algorithm shows better fault-tolerant performance.At last, this paper brings forward N+l parity rollback-recovery protocol based on agent for that many parallel processes may exist in the same host in MPI parallel...
Keywords/Search Tags:Software Fault-tolerant, checkpointing and rollback recovery, message passing interface, checkpoint overhead
PDF Full Text Request
Related items