Font Size: a A A

Study On Adaptive Failure Prediction Algorithm For Supercomputer

Posted on:2015-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:L SuFull Text:PDF
GTID:2268330422972198Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the development of information technology, large-scale distribute systemssuch as cloud computing are being widely deployed and applied to actual use. And withthe increasing complexity of software and hardware in application systems, how toensure that the systems work correctly for a long time and provide users with services ofhigh quality is the problem that should be thoroughly considered during the process ofthe systems’ design and development. If a system can implement self-diagnosis byfailure prediction, its fault tolerance and resource management capability could beimproved significantly that the system’s high availability and high reliability could alsobe guaranteed. Supercomputers have complex computing systems; research on failureprediction has important meaning in improving supercomputers’ computingperformance and systematic fault tolerance. And efficient failure prediction strategiescould be applied to enhancing the fault tolerance ability of other large-scale systems.This thesis is based on the Reliability, Availability and Serviceability (RAS) eventlogs generated by supercomputer. Using these logs, we proposed semantic time filteralgorithm (STF, for short) to reveal the system failure behavior. STF algorithm takesboth semantic correlation and time correlation between event logs into account, filteringredundancy logs based on the correlations. Through the result of filtering experiments,the filtered log sequences can reveal how non-fatal events evolve into fatal events. Thisobservation supports the establishing of our adaptive failure prediction model.After analyzing the filtered event logs, we created failure prediction model basedon classification algorithm. We partition the time into fixed windows, and attempt topredict whether there will be fatal event in every window based on the event pattern inthe preceding windows. According to the dynamic training set, we used AdaBoost toboost SVM classifier adaptively by adjusting the key parameter of SVM.We selected event logs generated by IBM BlueGene/L over215days as data sets.The comparative experiments results show that, our adaptive failure prediction modelAdaBoostSVM outperforms the other failure prediction models (TBF, RIPPER, kNNand SVM based), especially, leading to failure recall rate increased by10%-20%.
Keywords/Search Tags:systematic fault tolerance, supercomputer, log analysis, adaptive failureprediction
PDF Full Text Request
Related items