Font Size: a A A

Research On Failure Prediction And Fault-tolerance Technology For Supercomputer

Posted on:2018-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:W HuFull Text:PDF
GTID:1368330569498501Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The requirements of ever-growing scientific applications promote the development of supercomputer.With the increasing size of the supercomputer,the components increase,the hardware and software structure is more complex,and the operation mode changes rapidly,thus the mean time to failure of supercomputer is much shorter and reliability issues become increasingly prominent.However,existing fault-tolerance technologies are inefficient.Therefore,the fault-tolerance problem has become one of the most challenging issues of supercomputers.In this thesis,aiming at improving the operation efficiency of large-scale parallel application in the frequent failure circumstance,we focus on a series of researches on proactive and reactive fault-tolerance technologies.The main contributions are as follows:1.A distributed data collection method for failure predictionExisting data collection methods of supercomputers for failure prediction are inefficient for some reasons,such as lacking data features and causing huge overhead,which could impact the prediction accuracy.To solve this problem,we propose a distributed data collection method DDC which is oriented to the future exascale supercomputer.When DDC is running,lightweight processes are distributed on different nodes for data collection.In this way,the data collection has low overhead and high sensitivity,and thus can fully meet the real-time requirement.For data aggregation,we firstly propose an adaptive multi-layer data aggregation method,which can not only acquire the node's key state data before failure but also reduce the I/O resource consumption using high-speed interconnection network.To cut down the overhead of data aggregation,we then propose an improved method,a loop-based data storing and transmission method like one-way circular linked list.This method can greatly reduce the network and I/O overhead since it only transmits and stores the state data of the fault node and little data of normal node.Experiment results for DDC on TH-1A show that DDC had the advantage of low overhead and good scalability.2.An adaptive ensemble failure prediction method for supercomputerTo solve the problem that existing failure prediction methods have low accuracy and lack of adaptability to the dynamic situation,we propose FSoE,an adaptive ensemble failure prediction method based on feature selection and online ensemble prediction for supercomputer.First,we propose FSFW,a feature selection method based on the combination of filter and wrapper.The filter method in FSFW sorts the data features based on the mutual information combined with distance measurement.The wrapper method in FSFW quickly selects target feature subset on the basis of the sorted features using SVM.Then,after feature selection,we propose an on-line data mining prediction method GAE(Group based Adaptive Ensemble),in which SVM is used as the base classifier.GAE predicts the node state using the optimal classifier subset based on the similarity grouping method.We also use a sliding window to predict the future status of nodes.Experiment results show that the accuracy of the classifier is improved effectively after FSFW feature selection,and the prediction accuracy of GAE is higher than existing classical prediction methods.Therefore,the prediction of the FSoE based on the hardware environment state data and the system operation state data effectively solves the problem of the node failure prediction.3.A fault-tolerance framework combining proactive and reactive fault-toleranceCurrently,the frequency of failure in the current supercomputer gradually increases,however,the existing reactive fault-tolerance methods have high overhead,which seriously affects the performance and scalability of the parallel application.To solve this problem,we propose FTRP,a new fault-tolerance framework combining proactive and reactive fault-tolerance.Based on the WM cost model and the failure prediction results,FTRP can select fault-tolerance mechanism adaptively.Based on the observation,analysis,and experiment of the supercomputer,we find the failure locality characteristics in the supercomputer for the first time and based on this characteristic we propose a new faulttolerance method PRP2.The PRP2 method provides both a process replication mechanism and a process prefetching mechanism.It can not only protect processes on the nodes which are correctly predicted to have an impending failure,but also provides protection mechanisms for processes on the nodes when the predictor misses.Therefore,it can improve the efficiency of proactive fault-tolerance.FTRP can effectively utilize the advantages of proactive and reactive fault-tolerance mechanisms and avoid their flaws,and thus can improve the application performance for large-scale parallel systems.Simulations with actual failure traces show that our framework outperforms existing fault-tolerance mechanisms with much efficiency.4.Scalability analysis model for checkpointing fault-tolerance technologyCurrently,checkpointing is the most widely used fault-tolerance technology in supercomputer,however,the frequent checkpoint save operation will bring huge I/O overhead.Especially for the future exascale computing,the huge checkpoint overhead would seriously constrain the performance and scalability of large-scale parallel application.In this thesis,we quantify the impact of data saving overhead of checkpointing on the application scalability.Based on analysis of the I/O overhead caused by the checkpointing,we propose a storage-bounded speedup and storage wall model,which quantitatively models the scalability of parallel applications from the perspective of storage performance.Then we analyze the storage wall characteristics of parallel applications under different storage architectures.Finally,we take some experiments on TH-1A and Jaguar to quantitatively analyze and validate the effect of checkpointing on the scalability of parallel application using the storage-bounded speedup and storage wall model.The experiment results provide useful guidance for the research of fault-tolerance technology including proactive fault-tolerance and reactive fault-tolerance.
Keywords/Search Tags:failure prediction, proactive fault-tolerance, fault-tolerance model, reliability, supercomputer
PDF Full Text Request
Related items