Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems

Posted on:2018-03-31

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Zhu

Full Text:PDF

GTID:1368330563995797

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently,the demand for the computing power in all fields increases rapidly.To meet this demand,the scale of the high performance computing systems（HPCs）has been increased dramatically.With the system scale increasing,the reliability of HPCs decreases.Therefore,the fault tolerance is an indispensable technique for HPCs.However,the increasing system scale not only increases the system failure rate but also increases the overhead of the fault tolerance.Thus,how to reduce the overhead of the fault tolerance has become one of the major challenges of HPCs.Based on the fault prediction technique,the proactive fault tolerant technique avoid failures by performing proactive actions before the failure occurs.The main advantage of the proactive fault tolerance is that it can reduce the execution frequency of the fault preventive operations.However,due to the lack of the efficient proactive action（PA）and optimization strategy,the overhead of the existing proactive fault tolerant approaches is still too high for the exascale system.In this paper,we study the overhead optimization for the proactive fault tolerant techniques.The main contributions of this paper are as follows:1.This paper introduces a predicted type of the failure（PTF）based proactive fault tolerant approach（PTFPF）.To reduce the overhead of the PTFPF,This paper proposes the overhead balanced proactive action selecting strategy（OBPASS）and the gain aware two-level proactive checkpointing strategy（GTPCS）.There are various types of failures of HPCs.However,there is no proactive action that tolerates all types of failures with low overhead.Thus,the overhead of the proactive fault tolerant approaches with single proactive action are not low enough.Since the failure rate of extreme-scale systems is very high,the overhead of those approaches is unacceptable when they are used in such systems.Thus,this paper studies the proactive fault tolerant approach with multiple proactive actions,and proposes PTFPF.To address the impact of the false PTF on PTFPF,the performance model for PTFPF is established.Based on the model,this paper introduces an optimal PA selecting strategy named OBPASS.OBPASS estimates the expected overhead of different proactive actions so that the system can always select the proactive action with lower expected overhead.Then,to further reduce overhead of PTFPF,the multiple level proactive checkpointing（MLPC）method is studied.To address the issue that the predictor may forecast the failure level falsely,GTPCS is proposed.Based on the system parameters,GTPCS approximates the gains and losses that may be obtained by storing different levels of checkpoints.The system selects the checkpointing level based on those results.The evaluations show that:OBPASS will reduce the overhead of PTFPF by 8%if the predictor cannot forecast the failure type accurately;for the extreme-scale systems,the PTFPF reduces the overhead by 20%compared with the existing proactive fault tolerant approaches.The simulations of MLPC show that:GTPCS reduces the impact of the false failure level on PTFPF with MLPC;for the extreme-scale systems,GTPCS based MLPC method further reduces the overhead of PTFPF by 12%.2.This paper studies the proactive uncoordinated CR technique with the domino effect（PUCRD）.To optimize the overhead of PUCRD,this paper proposes the minimum set logging（MSL）method and optimized storage protocol for the proactive message logging（SPPML）technique.Due to domino effect of uncoordinated CR,this technique has to work with the message logging technique.However,to the best of our knowledge,the existing message logging approaches are reactive methods.Thus,these approaches are costly.To address this issue,this paper proposes MSL for the proactive fault tolerance.Based on predicted fault location,MSL reduces the overhead of the message logging by ignoring the messages that unrelated to the fault.Based on MSL,this paper introduces the proactive message logging（PML）technique.Then,to reduce the overhead of PUCRD,this paper proposes SPPML.Based on the overhead estimation,SPPML reduces the overhead of PUCRD by switching the storage protocol according to the system state.The experiment results show that:MSL reduces the overhead of PML by 83%;compares with HM_PLL technique,PML which based on the MSL reduces the time overhead of the message logging over 95%;compares with the traditional storage strategy,SPPML reduces the overhead of PUCRD by 6%.Further evaluations show that,in the cases that the system scale is more than1M,the overhead of PUCRD is 25%lower than the existing proactive fault tolerant approaches.3.This paper proposes a unified model for time redundancy based proactive action,（UMTPA）,the unified period optimization strategy for time-redundancy based proactive action（UPOTP）and the minimum grouping strategy of the unified fault-tolerance with PML（MGSUP）.Due to the absence of a unified analytical model that characteristics the overhead of proactive time redundancy based fault tolerance approaches,this paper introduces the unified time redundancy proactive fault tolerance based on the hierarchical checkpointing technique.Then,UMTPA is proposed.To optimal the overhead of UTPF,this paper proposes UPOTP.UPOTP optimizes the length of the computing fragment of UTPF by differentiation.Based on feature analysis,this paper finds out that the overhead of the PML decreases with the number of system groups increasing.However,for UTPF,the increasing number of system groups will increase the probability that predictor forecasts a failure location falsely.Thus,the maximum grouping policy is problematic.To balance the positive and negative impacts of system grouping on UTPF,this paper proposes MGSUP.Based on MGSUP,the hierarchical checkpointing reduces the overhead of UTPF significantly with a small number of system groups.The simulations show that:UMTPA characterizes the overhead of the wildly used time redundancy proactive actions accurately;UPOTP estimates the best length of the computing fragment for UTPF precisely.Besides,MGSUP approximates the optimal number of the system groups（G^*）effectively.If the optimization gain of the maximum grouping policy is 1,then UTPF achieves over 98%optimization gain with the number of the system groups is G^*.The overall evaluations show that:compared with PTFPF and PUCRD,UTPF reduces the fault tolerant overhead by 22%and 17%respectively for the extreme-scale system.4.This paper proposes a sparse representation classification with time slice and correlation table（SRTC）,and introduces a system log pre-processing method based on SRTC.This method is name sparse representation classification based pre-processing（SRCP）.The recall and precision of the invalid record filtering method of the log pre-processing technique will impact the validity of the log-based failure trace used in the simulations of this paper and the probability of the false negative event and false positive event of the predictor.Eventually,it will increase the overhead of the proactive fault tolerant approach.The filtering recall of the existing log pre-processing approaches is high,but the filtering precision is relatively poor.To address this issue,this paper studies the optimization method of the filtering precision for the log pre-processing and proposes SRTC.SRTC improves the filtering precision of the log pre-processing effectively.Meanwhile,the impact on the filtering recall is acceptable.The experiment results show that:compared with the existing pre-processing approaches,the SRCP improves the filtering precision and F1-measure by 8%and 3.5%respectively.Therefore,SRCP improves the effectiveness of the log-based failure traces used in the simulations.In addition,the SRCP reduces the probability of the false negative event by 7%for the predictor without any prediction algorithm modification.

Keywords/Search Tags:

High Performance Computing, Log Pre-processing, Proactive Fault Tolerance, Task Migration, Checkpoint/restart, Multi-level Checkpointing, Hybrid Fault Tolerance, Overhead Optimization

PDF Full Text Request

Related items

1	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
2	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
3	Study And Implementation Of Fault Tolerance For Heterogeneous Parallel Computer
4	The Research And Implementation Of Checkpoint Technology Based On WinNT
5	Achieving Fault-Tolerance And High-Performance In Grid Applications
6	Research On Low Overhead Non-blocking Checkpointing Scheme For Mobile Computing System
7	Research And Implementation Of Mapreduce Fault Tolerance Method Based On Intermediate Result Checkpoint
8	Research On Incremental Checkpointing And Rollback Recovery
9	Study And Implementation Of Application-Level Checkpointing
10	The Systematic Study Of Fault-tolerant Die