Font Size: a A A

Research On Key Technologies Of Failure Prediction Based On Machine Learning Method For Exascale System

Posted on:2012-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhouFull Text:PDF
GTID:2218330362460265Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the demands of science and engineering applications keep rising and expanding, realizing Exascale supercomputer system has become one of the intending research goals in the filed of high-performance computing in many developed countries. Because those advanced enabling technologies remain at the fundamental research stage, nowadays, the basic develop method and technical route to implement such system is in the manner of integrating processors. However, considering the constraints of current preparation technology, the reliability of physical device itself is difficult to be ensured or improved. Therefore, as the scale of parallel system increases rapidly, system failures which become more and more frequent would introduce great challenge confronting system reliability. The mainstream fault-tolerant method, i.e. Rollback and Recovery, has many drawbacks such as continual checkpointing, large amount of backup data, huge operating overhead to accomplish system recovery and so on, so it could not be used effectively in the coming Exascale system.In this thesis, we focus on the active fault-tolerant methods and consider combining them with traditional passive ways, so as to solve the reliability wall problem which exists during the process of designing and implementing large-scale system.First, we construct a self-controlling active fault-tolerant model at the system node level. Then combined with passive fault-tolerant methods, we propose a two-level hierarchical fault-tolerant scheme, which fuses active and passive fault-tolerant techniques and executes in the sequence of'first active, then passive'. As for failure prediction, the crucial process step of system active fault tolerance, we build up an online failure prediction model based on machine learning method, and design its overall performing flow and functional module framework of each system node respectively.Gathering and handling system status information timely is the premise of effective failure prediction. In order to support the dynamic online failure prediction, we realize status real-time collecting method and information periodic gathering method, and configure them perform automatically. With the IASF method we fulfill, we preprocess these gathered system logs and remove massive useless information from them successfully.Based on the remaining log information after temporal and spatial filtering, we design some failure character parameters related to system logs, and define their calculating methods. According to the time window shifting manner, each system node dynamically generates these parameters and forms a failure character record corresponding to its current status. The status information will be used for system failure prediction. To simplify the failure character record, we realize two kinds of dimension-reduction methods, i.e. Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA), and select several key characters from those various parameters.Using the training samples consisting of simplified failure character records and system status feedback, we adopt ID3 and C4.5 algorithms, two mainstream decision tree algorithms, to perform the machine learning process. Making use of the decision tree being constructed, we realize the rule-drawing method and obtain general and simple failure predicting rules. Taking these rules as its classifier, each system node distinguishes its normal and abnormal node status, so as to predict node failures occur in the near future.In the end, we examine and evaluate each processing stages of the proposed online failure prediction method. The experiment result shows that parallel system would achieve its best failure prediction effects when adopting the configuration and execution manner proposed in this thesis.
Keywords/Search Tags:Active Fault Tolerance, Failure Prediction, Machine Learning, Log Information Process, Failure Character Extracting, Decision Tree
PDF Full Text Request
Related items