Font Size: a A A

Research On Self-Healing Regulation Technology For Distributed Mission-Critical Systems

Posted on:2012-07-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LuFull Text:PDF
GTID:1118330368482908Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The complexity, heterogeneity and dynamics application environment of Distributed Mission-Critical Systems (DMCS) inevitably lead to system failure, mission suspending, running interrupt even system crash and other phenomena, causing huge economic lives losses and other serious consequences. Meanwhile, DMCS failure also makes manual management and manipulation more difficult. Threrfore, autonomic computing technology with the core goal of self-management has been studied in various fields. Self-Healing Regulation Technology (SHRT) is the one of the critical technologies of autonomic computing. DMCS-oriented self-healing regulation technology can achieve fundamental functions such as failure prediction, self-healing policy generation and critical task scheduling, which have a significant influence on dependability and sustainability of critical mission running. In this paper, aiming at the dependability and sustainability requirements of critical mission running, self-healing regulation technology and its application have been studied systematically.In this dissertation, self-healing regulation technology research started from the overall design principle discussion. Firstly, the critical problems of SHRT overall design has been analyzed, then the comprehensive evaluation metrics system has been proposed. Secondly, the SHRT architecture has been proposed and the critical implementation has been analyzed. Aiming at formal validation of critical task execution flow, n-calculus has been applied to describe the semantic task execution and switch. Moreover, the critical task execution logic has been validated, which can provide theoretical feasibility and rationality assurance.Self-healing regulation policy dynamic generation method is the critical research topic for the DMCS-oriented self-healing regulation technology. Based on the self-healing regulation architecture, the policy-based self-healing regulation pattern has been proposed. The basic expression format and logic syntax have been discussed. In addition, the policy simplifying and classifying approach has been proposed for the dynamic policy management. In order to solve the problem of inaccurate failure detection and diagnosis, a Partially Observable Markov Decision Processes (POMDP) based self-healing policy re-generation algorithm has been proposed and the policy convergence has been analyzed theoretically. In the experiment we used Los Alamos National Laboratory (LANL) failure data to count the real effect of recovery policy, which showed the necessity of self-healing regulation technology, and then in the simulation experiment we calculated the policy solving iteration and convergence speed and compared different type self-healing policy performance. Our research result can point out the direction of self-healing policy generation and optimizing.Self-healing regulation data analysis and prediction is a necessary condition for DMCS self-healing. Aiming at the high dimension and sparsity feature of nonlinear correlated failure failure of high-performance computer system, an information-theoretic based co-clustering algorithm for nonlinearly correlated failure data was proposed. The co-clustering algorithm was measured using mutual information entropy. And the convergence and local optimality of co-clustering algorithm were proved theoretically. Second, the manifold learning algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. In the experiment we first compared the clustering effect of different methods on LANL data, and then we collected system performance metrics under fault injection and normal state. We compared the failure prediction performance and the experimental results on labeled failure data showed that the co-clustering analysis algorithm outperformed other clustering analysis algorithms and has the features of rationality and effectiveness for discovering the nonlinearly correlated failure patterns. The failure analysis and SLLE based prediction results demonstrated that our method can help to predict underlying failures.Critical task scheduling for self-healing regulation in DMCS is a significant assurance for SHRT design and implementation. Taking failure randomicity and critical task running continuity into consideration and to achieve the rational scheduling of failed task, a critical task scheduling method based on Directed Acyclic Graph (DAG) task reconstruction and migration is proposed with the principle of scheduling first, optimization after. Firstly, the DAG of correlated task was regenerated according to the proposed DAG dynamic reconstruction algorithm to transform the correlated task to layered DAG task. And then the critical task migration route was computed and migratble task deadlock avoidance analysis is provided. By critical task migration to current idle resources, task execution time can be reduced markedly. Simulation experiment tested the task speedup performance of task migration method and waiting-recovery method with three kind of faults injected. The experiment results showed that task migration method can achieve the better scheduling quality under the flexible load and unknown fault injection.
Keywords/Search Tags:Autonomic Computing, Self-Healing Regulation, Self-Healing Strategy Generation, Failure Prediction, DAG Task Migration
PDF Full Text Request
Related items