Font Size: a A A

Research On Failure Analysis,Modeling And Prediction For Supercomputers

Posted on:2019-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:R T LiuFull Text:PDF
GTID:1368330566970878Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of supercomputers,the scale and complexity are ever increasing,and the reliability and resilience are faced with larger challenges.There are many important technologies in fault tolerance,such as proactive failure avoidance technologies based on fault prediction,reactive fault tolerance based on checkpoint,and scheduling technologies to improve reliability.Both qualitative and quantitative descriptions on characteristics of system faults,and effective fault prediction are very critical for these technologies.This dissertation analyzes the characteristics of failures on two typical supercomputers called Sunway Blue Light(based on multi-core CPUs)and Sunway TaihuLight(based on heterogeneous manycore CPUs).It presents multiple new methods of fault analysis for supercomputers and uncovers some fault characteristics and rules previously unknown.It also builds the model of fault distribution and resilience for supercomputers and presents an effective fault prediction method.The main contributions and creativities of the dissertation are listed as follows:1.Aiming at the discrete,diverse,instantaneous,uncertain and non-retroactive system faults in supercomputers,the dissertation proposes a scalable fault monitoring,collection and analysis framework for supercomputers as described as follows: With distributed infrastructure,a extensible state monitoring and acquisition model based on the triggered event is proposed,which can aquire the failure status information of the massively parallel system in real time efficiently.Experiments show that the performance of the state monitoring model is independent of the system size,and it can realize the fault discovery of less than 20 seconds for the large-scale parallel system.Based on fault sensing point setting and fault data processing methods,a fault analysis system based on statistical data is established,which can effectively analyze and discover the features and influencing factors of supercomputer faults.By fault analysis,it is found that the mainframe consisting of CPU,memory,and interconnecting is the major source of failures on the supercomputer.2.To address the feature of memory failure as one of the major faults in supercomputers,an analysis method for correlation relationship between memory faults based on sequential pattern mining is proposed.The method establishes a sequence rule model for correlation analysis between memory faults.Based on the big data of memory failures on the supercomputer's mainframe,it can effectively analyze the correlation relationship between DRAM SBEs(single bit errors)and DRAM MBEs(multiple bit errors),also the one between the memory failure sequence and a successive memory failure on the CPU node.The key conclusions affecting system fault-tolerance design and memory failure prediction which are previously unknown are found.The conclusions include: DRAM single bit errors do not lead to DRAM multiple bit errors;the CPU node's memory failure sequence may lead to a successive memory failure.3.To address the influencing factors of the failures on the main computing components of supercomputers,a fault feature identification method combining statistical rules and co-analysis is proposed.This method sets or selects a targeted experimental environment.Based on the statistical data,it finds and verifies the failure law of the main computing components,and identifies the key influencing factors in the system that affect the reliability and faults of the main computing components.The conclusions include: DRAM single bit error has nothing to do with the job,and it is related to the reliability of the CPU node or DRAM;The memory failure may be related to the reliability characteristics of the memory chip itself.Pure compute-intensive applications have minimal impact on CPU faults or failures.4.For the quantitative description of the failure time of the main computing components in the supercomputer,the failure data of the supercomputer is analyzed according to the time and space dimensions,and a uniform multi-dimensional failure time model for the supercomputer is built.The model mainly includes: a uniform failure time model for memory on the CPU node and a uniform multi-dimensional failure time model for the CPU node,the computing card and the mainframe.The model was used to evaluate the reliability.In combination with the failure prediction scenario,a failure prediction model based on the time between failures was established and the application and solution methods were analyzed.The model includes: The time between failures of the CPU node's memory can be described quantitatively using the Lognormal distribution;The Weibull distribution best matches the time between failures in a multidimensional space.5.To address the problem of high checkpoint overhead in the supercomputer due to mismatch between the checkpoint fault tolerance and the reliability of the actual running environment,a data-driven self-adaptive fault-tolerant model is proposed.Based on the distribution of the failure time on fine-grained resources,this model establishes a multi-level failure model for supercomputer-based complex faults.According to the dynamic fault characteristics of the system,a data-driven adaptive fault-tolerance method is proposed,and an adaptive optimization algorithm is designed.Through the fault-tolerance experimental analysis on Sunway TaihuLight,the data-driven self-adaptive fault-tolerance model and checkpoint optimization method is verified.The analysis shows that the optimal checkpoint interval can more effectively reduce the overhead of checkpoint than the empirical checkpoint interval.6.Aiming at the problem of accurate fault prediction required by active fault-tolerance technologies on supercomputers,a fault prediction algorithm based on sequential pattern mining of time-stamped multi-sequences is proposed.The algorithm is based on serial Winepi algorithm,also expanded and improved for multi-sequences.The algorithm uses sliding windows to mining sequential patterns with the constraints of time window size on time-stamped multi-sequences and realizes fault prediction with the location and time.The algorithm is used to predict faults on Sunway Series supercomputers.The result shows that the generated prediction rules have good confidence and faults on supercomputers can be effectively predicted.The prediction precision is between 60% and 99%.
Keywords/Search Tags:supercomputer, monitoring model, sequential pattern, correlation relationship, statistical rule, co-analysis, fault feature, multidimensional failure model, data driven, fault tolerance model, time-stamped multi-sequences, data mining, fault prediction
PDF Full Text Request
Related items