Approaches For Failure Prediction And Resource Re-allocation In Cluster Systems

Posted on:2018-11-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W W Zheng

Full Text:PDF

GTID:1318330518994731

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Cluster systems consist of multiple computers interconnected by high-speed networks. These computer nodes cooperate and work together as a single system to provide resources with high reliability and high availability. As the scale of underlying nodes in the cluster system grows,the interactions between system components become more complicated.Moreover, the online upgrade and online repair become more and more frequent due to the dynamics of system runtime environment. In large-scale cluster systems, failures are becoming a norm rather than an exception. "Failure" usually refers to any anomaly caused by the hardware or software defects, incorrect designs, unstable environments or operation mistakes that make services or cluster nodes unavailable. Thus,to ensure the system reliability and availability so as to adapt the system dynamics, complexity as well as fault sensitivity, it generally employs fault-tolerance management techniques, allowing the system to provide and maintain acceptable-level services in the presence of failures. The traditional fault-tolerance mechanism usually exploits fault detection and fault repair techniques in a passive manner when handling errors as well as component faults, in order to avoid upgrading from these errors to system faults. While the proactive fault-tolerance management for the cluster system allows autonomic failure prediction and resource re-allocation, and thus is of great significance to further enhance the cluster system performance and ensure the cluster system availability.In the proactive fault-tolerance management, the failure prediction serves as the core technology that can predict the upcoming failure before it occurs. Based on the predicted results, one can exploit effective means to prevent the failure occurrence or to reduce the time to repair failures,ensuring the efficient and stable system operation. On the other hand,resource re-allocation with cluster failures can response to emergent failures and migrate the failure effects timely. Using failure prediction results and combining them with system performance states, it forms efficient resource allocation schemes without affecting runtime services,thus guaranteeing the system failure resilience. Based on the analysis of related research efforts on failure prediction techniques as well as resource (re-)allocation techniques, we conducted the work on failure prediction in two aspects��node failure prediction and system failure prediction, and then designed a failure-aware resource re-allocation approach. Our main contributions can be summarized as follows.(1) A fuzzy neural network (FNN) based prediction approach for node failures is proposed. For node failures incurred by undetected errors,the proposed approach analyzes various index data indicating the node runtime state collected by the monitoring system, and expresses the correlations between the monitoring indices and the failure occurrence as the weights between neurons in the FNN. Then, the upcoming failure can be predicted by using the fuzzy inference and the adaptive-learning mechanism in FNN. Simulation results show that the proposed approach not only improves the prediction performance for node failures, but also can accelerate the convergence rates with its effective parameter initialization that indicates the convergence directions.(2) A node failure prediction approach based on hidden Markov models (HMMs) and cloud theory (CT) is proposed. For node failures caused by detected errors, the proposed approach extends the HMM, and models the process from the normal to the failure as hidden-state transitions in the HMM. In addition, the proposed approach exploits algorithms in the CT to replace the traditional HMM training algorithms,thus reducing computation overhead for the model training. Simulation results show that the proposed approach provides an accurate prediction at low computational cost, capturing a tradeoff between the failure prediction performance and the computation overhead.(3) A system failure prediction approach based on association rule mining is proposed. For the exhibited correlation among failures in the cluster system, based on historical failure data, the proposed approach models the failure correlation pattern by using probabilistic shared risk group (PSRG). It extracts frequent items in the failure data to characterize the failure correlation by exploiting the parallel association rule mining technique. Simulation results show that the proposed approach provides better performance in terms of both the failure prediction and the execution efficiency, and can be potentially more suitable for failure prediction in a larger-scale cluster system.(4) A resource re-allocation approach for cluster failures is proposed.During the resource (re-)allocation, applications can be divided into multiple executable tasks that communicate with each other and need to be assigned to specified cluster nodes before execution. For the issue that failure-prone nodes decrease the system throughput and service reliability,the proposed approach takes into account both the performance states and the reliability states for candidate cluster nodes when the task is assigned to cluster node. It exploits the cooperative coalition game model, adjusts the node resource price dynamically to control the cooperation, and shares resources to accomplish the re-mapping between tasks and cluster nodes. Simulation results show that the proposed approach with the effective coalition game can obtain a good resource re-allocation solution that optimizes the system resource utilization while ensuring the system service reliability. In addition, on the basis of failure prediction in the above (1)(2)(3), the proposed approach exploits the prediction results to evaluate the cluster system reliability states and achieve resource re-allocation for cluster failures, thus forming a complete proactive fault-tolerance management model for cluster systems.

Keywords/Search Tags:

Cluster Systems, Active Fault-Tolerance, Failure Prediction, Resource Re-Allocation

PDF Full Text Request

Related items

1	Research On Failure Prediction And Fault-tolerance Technology For Supercomputer
2	Research On Key Technologies Of Failure Prediction Based On Machine Learning Method For Exascale System
3	Failure Tolerance And Prediction For Storage Systems In Datacenters
4	Research On Failure Analysis,Modeling And Prediction For Supercomputers
5	Research On Method For Hard Drive Failure Prediction In Massive Storage System
6	Research On Failure Prediction Of Supercomputers Based On Online Machine Learning
7	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
8	Research On Container Migration Mechanisms For User Level Fault Tolerance
9	Research On Fast Fault Tolerance Mechanism For Single Point Of Failure In Stream Computing Environment
10	Research On Fault Characterization And Reliability Index Prediction And Allocation Of Industrial Robot