Font Size: a A A

Analysis Of Hardware Fault Propagation In Programs And Research On Fault-tolerance Techniques

Posted on:2013-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H XuFull Text:PDF
GTID:1268330392973859Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of manufacturing technology, the increase of system scaleand the rising of heterogeneous systems, the performance of high performance computerskeepsimprovingwhilethereliabilityproblemofsupercomputersbecamesmoreandmoreserious. The reliability problem has been one of the factors which limit the developmentof high performance computing (HPC). Although the reliability of supercomputers canbe upgraded by improving the reliability of hardware components or using redundanthardware components, the cost of such fault tolerance methods is too high. Softwareimplemented hardware fault tolerance (SIHFT) methods can implement hardware faulttolerance by modifying the programs without any modification to hardware.Hardware faults and errors propagate during the program execution. The analysisof hardware fault propagation in program can help researchers to implement hardwarefault tolerance by software methods more effectively. Consequently, the researches ofthis thesis are divided into the base part and the application part: the base part analyzesthebehaviorsofhardwarefaultpropagationsinprograms, andtheapplicationpartdesignssome optimized fault tolerance methods based on the analysis results of the base part.In the base part, this thesis chooses three typical programs: serial programs, ho-mogeneous parallel programs and heterogeneous parallel programs, and investigates thebehaviors of hardware fault propagations in those programs. The main work and contri-butions of this part are as follows:1. Building the model of hardware fault propagation in serial programs (Chapter2)Serial programs are a basic kind of programs, and the analysis of hardware faultpropagation in serial programs is the base of researches of hardware fault propa-gations in programs. This thesis classifies the errors occurring in the propagationinto three types: original errors, data-flow subsequential errors, control-flow sub-sequential errors. By utilizing forward data flow analysis based on a program’sdetailed control flow graph, we give the error propagation equations and relevantalgorithms of data-flow and control-flow subsequential errors in serial programs.In a word, we build the model of hardware fault propagation in serial programs.Given original errors, researchers can compute the error information at every point in serial programs based on the model.2. Building the model of hardware fault propagation in homogeneous parallel pro-grams by taking MPI program as an example (Chapter3)MPI programs, a typical kind of homogeneous parallel programs, are currently thede-facto standard used in parallel and distributed computing area. According to thecharacteristics of MPI programs, this thesis classifies the data-flow subsequentialerrors into two subtypes: intra-process errors and inter-process errors. Taking vari-ables and variable copies in concrete processes as carriers, we respectively analyzethe propagation of inter-process errors and get the error propagation equations andrelevant algorithms of data-flow subsequential errors in MPI programs. In a word,we build the model of hardware fault propagation in MPI programs. Given originalerrors, researchers can compute the error information at variable or variable copygranularity at every point in MPI programs based on the model.3. Building the model of hardware fault propagation in heterogeneous parallel pro-grams by taking GPGPU program as an example (Chapter4)CPU-GPUheterogeneoussystemshavebeenwidelyusedinhighperformancecom-puting, andGPGPUprogramsareatypicalkindofheterogeneousparallelprogram-s now. According to the characteristics of GPGPU programs, this thesis analyzesthe errors caused by hardware faults and divide them into CPU errors and GPUerrors. Due to the fact that some statements in GPGPU programs can execute asyn-chronously, we analyze the uncertainty of errors at a certain point in GPGPU pro-grams and design some equations and algorithms to compute the errors conserva-tively. We also propose the accelerated method to analyze the error propagation inKernelbyexecutingerroranalysisKernelonGPU.Inaword, webuildthemodelofhardwarefaultpropagationinMPIprograms. Givenoriginalerrors, researcherscancompute the error information at every point in GPGPU programs and acceleratethe computation by utilizing GPU based on the model.In the application part, based on the above analysis results of hardware fault propa-gation in programs in the base part, this thesis proposes and implements some optimizedfault tolerance methods for MPI programs and GPGPU programs respectively. The mainwork and contributions of this part are as follows: 1. Proposing a weak blocking coordinated application level checkpointing for MPIprograms——WBC-ALC (Chapter5)This thesis analyzes the difficulties in application level checkpointing of MPI pro-grams, and proposes WBC-ALC, a weak blocking coordinated application levelcheckpointing for MPI programs, to overcome those difficulties. Concretely, weintroduce the basic idea and coordination mechanism of WBC-ALC, design theprogramming model and fault tolerance framework for implementing WBC-ALC,and give an implementation based on them. Experimental results show that pro-grammers can easily use WBC-ALC to provide fault tolerance for MPI programs,and WBC-ALC can reduce fault tolerance overheads effectively.2. ProposingalazyerrordetectionmethodforGPGPUprograms——LazyFT(Chapter6)This thesis analyzes the propagation characteristics of errors, which are caused bytransient faults on GPU computing units, in CPU-GPU heterogeneous platform.Based on the analysis results, we propose a lazy error detection method, design afault tolerance method LazyFT upon the error detection method, and give the faulttolerance framework of LazyFT. We also build a time model of GPGPU programexecution. Based on the time model, we respectively give two methods for select-ing best fault tolerance granularities for two typical sections in science computingprograms. We validate LazyFT by experiments, and experimental results show thatcomparedwithEagerfaulttolerancemethods, LazyFTreducesfaulttoleranceover-heads of GPGPU programs apparently no matter faults happen or not.3. Proposing a partial recomputing method for GPGPU programs——WBC-ALC(Chapter7)Based on analysis of the computations which really need to be recomputed in GPG-PU programs after transient faults occur on GPU, this thesis addresses the idea ofpartial recomputing for GPGPU programs for the first time. Further more, we pro-pose PartialRC, a fault recovery method upon partial recomputing, and design theprogramming model and fault tolerance framework of PartialRC. We also give thebasic principle, implementation and optimization of key techniques in the fault tol-erance framework. Experimental results show that compared with fault recovery based on full recomputing, PartialRC can reduce the fault recovery overheads ofGPGPU programs after transient faults occur on GPU.
Keywords/Search Tags:Hardware fault, Fault propagation, SIHFT, Data flow analysis, MPI, Checkpointing, GPGPU, Error detection, Partial recomputing
PDF Full Text Request
Related items