Font Size: a A A

Research And Implementation Of MPI Parallel Fault Tolerant Technology

Posted on:2012-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:H B NiuFull Text:PDF
GTID:2218330362960213Subject:Computer Science and Technology
Abstract/Summary:
With the rapid development of HPC systems, their reliability causes more and more concern. As an important method to improve the system reliability, fault tolerant technology deserves more in-depth and valuable investigation. MPI is widely used as the parallel programming environment, achieving fault tolerance in MPI systems becomes an important research direction.In this paper, we analyzed and compared the existing fault tolerant mechanisms, and chose the checkpointing technology to implement the fault tolerant MPI system. Specifically, we designed and implemented a portable and scalable MPI fault tolerant system - Variable-based Fault Tolerant MPI System (VFTS), which is also independent of current MPI standard implementations. All of our work with VFTS is as follows:This paper established a checkpointing performance model to guide users to add fault tolerance to applications and gave the requirements for the fault tolerant programs in VFTS to achieve the minimum time overhead based on this model. In addition, according to the characteristics of this system, we summarized the performance optimization methods for the fault tolerant applications, and proposed required time, space and communication constrain principles to obtain better system performance.Then a method of communicator dynamically reconstruction was proposed. The current static process model in MPI standard limits the fault tolerant ability of MPI system. However, the proposed communicator reconstruction method can isolate and exclude the failure processes. Then new processes were added to reconstruct the invalid communicator in the MPI programs dynamically, which makes the program dynamically recover its communicator and communication space after failure occurs.This paper designed the partnership agreement used to store and recover program user data in VFTS. By two or more processes save and restore user data for each other, the partnership agreement can recover the user data of failure processes. This agreement is simple-designed and user-friendly, and users can easily adjust program fault tolerant capacity by adjusting this partnership agreement.This paper also designed a global consistency protocol to ensure the correctness of program states. This simple and less cost protocol, supported by the data from the partnership agreement and the checkpointing mechanism, is applied to ensure the storage and recovery consistency of the program system data and user data when failure occurs.Finally the VFTS was designed and implemented, and we adopted NPB to test and analyze its performance in detail, including time overhead, space overhead, communication load, fault tolerant ability and system I/O for the system.
Keywords/Search Tags:MPI, fault tolerance, VFTS, performance model, communicator dynamically reconstruction, partnership agreement, global consistency protocol
Related items