Font Size: a A A

Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer

Posted on:2012-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J JiaFull Text:PDF
GTID:1118330362460340Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Parallel computing is a major ultra-high-performance computing technology. As the performance of GPGPU (General Purpose computation on Graphic Processing Units) keeps improving, heterogeneous parallel systems built based on CPU and GPU become a hot research field of high-performance computers. However, with the increase of the parallel computing system size, high-performance computers encounter serious challenges. Due to more complicated architecture and unique features of heterogeneous parallel systems and weak fault-tolerance of GPGPU, large scale heterogeneous parallel systems built based on CPU and GPU undergoes an acute reliability issue, which is lack of practical means.This paper studies the fault-tolerance technique of heterogeneous parallel systems. Based on the propagation behaviors of hardware error that propagates in software in heterogeneous parallel systems, this paper optimizes checkpoint size of application-level checkpointing, optimizes the global overhead of multiple checkpoints in heterogeneous parallel systems and proposes configuration solution, and explores a GPU-oriented multi-copies fault tolerance technique (RB-TMR). The main contributions of the paper are summarized as follows:1. A general computer oriented acceptance model is proposed. The acceptance and its degree of program results and multiple times of program execution are first defined. Based on them, theorems and corollaries regarding acceptance degree are obtained. This paper extends the theorems and corollaries in heterogeneous parallel systems and establishes the acceptance model of heterogeneous parallel systems. Cases are used to analyze the effect of two common fault-tolerance techniques (checkpoint/restart and TMR) on the acceptance model when the two techniques are applied in heterogeneous parallel systems. Therefore, constructive suggestions and optimization methods for fault-tolerance mechanism are obtained.2. Based on the theory about inter-procedural dependence, a method named error propagation model is proposed. It describes the prorogation behavior of hardware error in software in CPU-GPU heterogeneous parallel systems. Using this model, the system's checkpointing mechanism is designed and the checkpoint size is optimized. Experimental results show that this method can effectively reduce the overhead and improve the fault tolerance performance.3. In order to minimize the global overhead of multiple checkpoints, this paper proposes a placement optimization method for both synchronization and asynchronization mechanisms in heterogeneous parallel systems. First of all, two essential issues of placement optimization of multiple checkpoint locations are proposed. Secondly, based on the analysis of architecture and program features, two methods of checkpoint placement in heterogeneous parallel systems are proposed: synchronous checkpoint placement and asynchronous checkpoint placement). Further, for the two issues, the two methods are analyzed and modeled and their solution algorithms are given.4. A fault-tolerance technique (RB-TMR) combining rollback mechanism and TMR technique is proposed. It can effectively offer fault-tolerance for fail-stop fault and transient fault. We implement this technique according to architecture and program features of heterogeneous parallel systems. Besides, a source-to-source compile assistant tool is designed for the RB-TMR technique. The tool can assist users in implementing the RB-TMR technique in CUDA programs, alleviating their burdens. Experimental results exhibit that the RB-TMR technique can achieve high error checking and correction rate as well as decreases the probability of rollback. It is concluded that the RB-TMR technique demonstrates better fault-tolerance performance than the conventional checkpointing and TMR technique.
Keywords/Search Tags:High-performance computing, heterogeneous parallel systems, fault-tolerance, acceptance, application-level checkpointing, inter-procedural dependence, fault propagation behavior
PDF Full Text Request
Related items