Font Size: a A A

Research On Failure Prediction Of Supercomputers Based On Online Machine Learning

Posted on:2018-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q SunFull Text:PDF
GTID:2428330569998705Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The ever-increasing demand of applications drives the development of supercomputers.With the increasing size of the system,large number of components,low-voltage mode of work and complex hardware and software structure make the mean time between failures of supercomputers become shorter and shorter,the reliabilities of supercomputers becoming increasingly prominent.The original checkpoint-based passive failure-tolerant method is too costly and has seriously affected the usability of the system.In recent years,active failure-tolerant strategies based on failure prediction has become a main method for supercomputers to improve reliability.The existing failure-prediction technologies are basically off-line learning methods and have poor dynamic properties.They can't meet the application requirement of supercomputers.Therefore,an online failure-prediction method is urgently needed to learn and timely predict failures online.Which can help implement overhead active fault tolerance before the failure appears and improve the reliabilities of the system.This paper researches data pre-processing and failure-prediction technologies based onTianhe-1 supercomputers' nodes.In terms of data preprocessing,mainly through feature selection to simplify datasets,eliminate irrelevant and redundant data.In this paper,a feature selection algorithm based on traditional mutual information criteria of max-dependency,min-relevance(m RMR)is proposed which combines multi-criterion ranking and support vector machine(m CRC for short).m RMR is a kind of feature selection algorithm which has good effect on running efficiency and classification accuracy.But m RMR only base on mutual information criterion to measure relevance and redundancy of datasets,it has one-sidedness.m CRC improves classification accuracy by measuring the relevance and redundancy of mutual information and class separability.At the same time,m CRC searches for the best feature subset by the improved forward floating search method,which overcomes the shortcoming of how to determine the final feature subset in m RMR algorithm.Experiments show that the classification accuracy of m CRC is about 1.6% higher than that of m RMR,and the final subset of m CRC is 22% smaller than that of m RMR.In conclusion m CRC not only can improve the performance of the subset but also reduce the data acquisition overhead and the storage burden of Tianhe-1.In terms of failure prediction,this paper presents a failure-prediction method based on online machine learning for supercomputers.This technique uses ensemble data stream mining to learn the state data stream online and determine whether there will fail on the node.In the ensemble data stream classification methods,an ensemble algorithm with recalling and forgetting mechanisms for data stream mining(MAE)which propose by Zhao Qiangli has significant advantages on prediction accuracy and stability.The memory and forgetting features of MAE can alleviate the impact of the problem of class imbalance,but for seriously imbalanced datasets,there are still difficulties in learning the data block and prediction accuracy.And the dataset collected from Tianhe-1 supercomputer is seriously unbalanced,so there is an algorithm called Re MAE is proposed which based on MAE to solve this problem.The experimental results show that Re MAE algorithm has 37% higher recall rate than MAE,although the overall classification accuracy is lower than that of the MAE,which indicates that Re MAE is more accurate to identify the fault data.For failure prediction of supercomputer,the focus is to predict whether failure occurs in next phase.So Re MAE algorithm is more suitable for failure prediction of unbalanced dataset than other ensemble data stream mining algorithm.
Keywords/Search Tags:Supercomputer, Active Fault Tolerant, Failure Prediction, Feature Selection, Ensemble Data Stream Mining
PDF Full Text Request
Related items