Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer

Posted on:2012-04-05

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Jia

Full Text:PDF

GTID:1118330362460340

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Parallel computing is a major ultra-high-performance computing technology. As the performance of GPGPU (General Purpose computation on Graphic Processing Units) keeps improving, heterogeneous parallel systems built based on CPU and GPU become a hot research field of high-performance computers. However, with the increase of the parallel computing system size, high-performance computers encounter serious challenges. Due to more complicated architecture and unique features of heterogeneous parallel systems and weak fault-tolerance of GPGPU, large scale heterogeneous parallel systems built based on CPU and GPU undergoes an acute reliability issue, which is lack of practical means.This paper studies the fault-tolerance technique of heterogeneous parallel systems. Based on the propagation behaviors of hardware error that propagates in software in heterogeneous parallel systems, this paper optimizes checkpoint size of application-level checkpointing, optimizes the global overhead of multiple checkpoints in heterogeneous parallel systems and proposes configuration solution, and explores a GPU-oriented multi-copies fault tolerance technique (RB-TMR). The main contributions of the paper are summarized as follows:1. A general computer oriented acceptance model is proposed. The acceptance and its degree of program results and multiple times of program execution are first defined. Based on them, theorems and corollaries regarding acceptance degree are obtained. This paper extends the theorems and corollaries in heterogeneous parallel systems and establishes the acceptance model of heterogeneous parallel systems. Cases are used to analyze the effect of two common fault-tolerance techniques (checkpoint/restart and TMR) on the acceptance model when the two techniques are applied in heterogeneous parallel systems. Therefore, constructive suggestions and optimization methods for fault-tolerance mechanism are obtained.2. Based on the theory about inter-procedural dependence, a method named error propagation model is proposed. It describes the prorogation behavior of hardware error in software in CPU-GPU heterogeneous parallel systems. Using this model, the system's checkpointing mechanism is designed and the checkpoint size is optimized. Experimental results show that this method can effectively reduce the overhead and improve the fault tolerance performance.3. In order to minimize the global overhead of multiple checkpoints, this paper proposes a placement optimization method for both synchronization and asynchronization mechanisms in heterogeneous parallel systems. First of all, two essential issues of placement optimization of multiple checkpoint locations are proposed. Secondly, based on the analysis of architecture and program features, two methods of checkpoint placement in heterogeneous parallel systems are proposed: synchronous checkpoint placement and asynchronous checkpoint placement). Further, for the two issues, the two methods are analyzed and modeled and their solution algorithms are given.4. A fault-tolerance technique (RB-TMR) combining rollback mechanism and TMR technique is proposed. It can effectively offer fault-tolerance for fail-stop fault and transient fault. We implement this technique according to architecture and program features of heterogeneous parallel systems. Besides, a source-to-source compile assistant tool is designed for the RB-TMR technique. The tool can assist users in implementing the RB-TMR technique in CUDA programs, alleviating their burdens. Experimental results exhibit that the RB-TMR technique can achieve high error checking and correction rate as well as decreases the probability of rollback. It is concluded that the RB-TMR technique demonstrates better fault-tolerance performance than the conventional checkpointing and TMR technique.

Keywords/Search Tags:

High-performance computing, heterogeneous parallel systems, fault-tolerance, acceptance, application-level checkpointing, inter-procedural dependence, fault propagation behavior

PDF Full Text Request

Related items

1	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
2	Study And Implementation Of Application-Level Checkpointing
3	Fault-Tolerance Techniques Research For The Parallel CFD Application Software Framework
4	Achieving Fault-Tolerance And High-Performance In Grid Applications
5	Analysis Of Hardware Fault Propagation In Programs And Research On Fault-tolerance Techniques
6	The Study And Analysis On Fault-Tolerant Parallel Algorithm
7	Fault Tolerance For Distributed Parallel Stream Processing Systems
8	Optimization And Design Of High Reliability Parallel Heterogeneous Multi-Core Systems
9	Research On Fault-Tolerance Technology For Message-Passing System
10	Research On Fault Tolerance Of High-performance Computing With NVRAM