Font Size: a A A

The Study And Analysis On Fault-Tolerant Parallel Algorithm

Posted on:2009-10-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F DuFull Text:PDF
GTID:1118360278956590Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash.In this thesis we propose a new application-level fault-tolerant approach for parallel applications called Fault-Tolerant Parallel Algorithm (FTPA) which provides fast self recovery. Our study focuses on the the theoretical foundation of FTPA, the concept of FTPA, the design methodology of FTPA and a tool supporting for degisning FTPA. We also analyze and evaluate the performance of FTPA. Primary innovative work in this thesis can be summarized as following:1. The definition for the reliability of parallel recomputing under failure is given. Also we present the quantitative analysis approach for the reliability of parallel recomputing based on task dependence graph. According to the analysis approach, we analyze and compare the effects of time redundancy and space redundancy on the reliability of parallel recomputing.2. To speed up the recovery procedure and improve the reliability of parallel computing, we propose a new application-level fault-tolerant approach called Fault-Tolerant Parallel Algorithm (FTPA). FTPA is a parallel algorithm which provides fast self-recovery. FTPA saves data at data-saving points for correct recovery during its execution. When a process fails, the failure will be detected by all surviving processes, which will re-execute the work lost on the failed process in parallel. Parallel recomputing speeds up the recovery procedure, making the recovery time considerably less than the time between the last checkpoint and the crash.3. The design of FTPA allows the manipulation of each program section into a fault-tolerant program section with the insertion of a data saving section, a failure detection section and a recovery section. In this paper, the design methodology of FTPA is discussed comprehensively. The program sections finally used by FTPA are produced by the approach for partitioning the original program or by the approach for splitting and combining the partitioned program sections. The data required to be saved in a data saving section is determined by the definition use relationship for parallel programs. For designing a recovery section, two methods are presented: automatic parallelization and the template approach. We also design and implement the FTPA version programs for three typical parallel applications which are matrix LU decomposition, Fast Fourier Transform and bucket sort.4. In order to ease FTPA implementation, we develop GiFT, a source-to-source precompiler tool to transform an MPI/Fortran or MPI/C program with user instrumented compiler directives into its FTPA version. GiFT requires that the user use compiler directives to choose program sections and figure out the variables necessary to be saved in a data-saving section through control-flow analysis and data-flow analysis. GiFT automates implementing recovery sections by means of compiler directives.5. To analyze and evaluate the performance of FTPA. Firstly, the performance metrics for parallel systems under failure are presented. We also give a performance model to predict the FTPA completion time under failure. Based on the performace model, we evaluate the effects of the program section size, the data-saving overhead and failure rate on the performace of FTPA. Secondly, we evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.
Keywords/Search Tags:high-performance computing, fault-tolerance, reliability for parallel computing, fault-tolerant parallel algorithm, parallel recomputing, GiFT
PDF Full Text Request
Related items