Font Size: a A A

Failure Prediction Method Of Cluster System Based On Spatio-Temporal Correlation Analysis

Posted on:2021-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:J DongFull Text:PDF
GTID:2518306308473274Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,large-scale cluster systems have been widely used in the field of high-performance computing.However,with the rapid expansion of cluster size and the increasingly complex types of services and component structures,failures have become normal.As a proactive reliability management and failure prevention mechanism,failure prediction technology can predict whether the system will fail in the future by analyzing the historical state of the system,which is of great significance for improving the availability and applicability of the cluster systems.A cluster is formed by loosely coupling multiple computing nodes.Failures may occur at both the node and system levels.That is,nodes in the cluster may fail due to hardware or software faults,and the propagation and evolution of node failures may cause system failure within the cluster.The research found that there is a clear spatiotemporal correlation between node indicators and node failures,and between node failure instances in the cluster system.The specific manifestations are as follows:1)Time correlation:The abnormal fluctuations in the time dimension of node performance indicators are the symptoms of node failure.Frequent event sequences in the system logs herald the evolution of system failures.2)Spatial correlation:The performance indicators of nodes with potential failures will be different from that of other normal nodes,especially carrying similar services.Node failures will cause other nodes to fail due to the system's cooperative communication.In this thesis,by deeply mining the spatiotemporal correlation of failures,we study the precise prediction of node failure and cluster system failure.The existing node failure prediction methods mostly analyze the node performance monitoring indicators based on the classification prediction model,which has some problems such as too rough extraction of failure symptoms,imbalanced distribution of failure samples and normal samples.This thesis proposes a node failure prediction method based on spatio-temporal feature extraction(FP-STE).Aiming at the problem of rough feature extraction,an improved recurrent neural network HW-GRU(Improved GRU based on HighWay network),and a convolutional neural network CNN are used to extract the temporal and spatial characteristics of node parameters to increase the discrimination of different types of failure symptoms which improves the accuracy of prediction.In addition,considering the impact of sample imbalance on multi-class prediction,this thesis improves the ensemble learning model XGBoost based on SMOTE oversampling and cost-sensitive learning strategy.Experimental results show that the prediction accuracy of FP-STE algorithm is better than other methods,and it can effectively distinguish multiple failure types.For system failures,most current forecasting methods analyze system logs based on event-driven methods,but they generally have shortcomings such as low recall rate,poor operating efficiency,and large update costs.To overcome these shortcomings,a dynamic failure prediction method based on causal association analysis(DFP-CAA)is proposed in this thesis.This method uses a new log preprocessing algorithm to realize adaptive typical failure event identification and log filtering based on the semantic similarity and temporal correlation of events.For the inefficiency of the rule extraction and update and the cold start problem,an improved weighted incremental association rule mining algorithm(IWAprori)was used to mine frequent event sequences,generate failure-derived rules,and automatically trigger rules updating procedure throughout the system life cycle.Furthermore,considering the causality relationship between events and low inefficient rule reasoning,this thesis designs a weighted causality dependency graph to represent event rules,and based on the forward uncertainty reasoning of the causality graph to predict the possible failures in the future.Finally,this thesis verifies the effectiveness and superiority of the improved method through three real system logs LANL,Blue Gene/L,and Blue Gene/Q.
Keywords/Search Tags:failure prediction, cluster system, feature extraction, spatiotemporal correlation, causality dependency graph
PDF Full Text Request
Related items