Font Size: a A A

Research On Autonomic Recovery Mechanisms For Distributed Mission-Critical System

Posted on:2011-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Z YeFull Text:PDF
GTID:1488303308455084Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of IT technology and the changing requirements, the availability demands of Distributed Mission-Critical System was getting higher and higher, the information integrality of critical mission data should be not only guaranteed,but also the uninterrupted running or automatic recovery in a short time when failures happen. However, with the increasing system scale and complexity as well as inevitable faults such as malicious attack and bug, system failures happened frequently, and the tracking, analysis and recovery of failures have become extremely difficult. Therefore, abilities of self-monitoring, self-diagnosis,intelligent decision-making and self-recovery of the system according to different failure scenarios and self-recovery are in urgent need. Autonomic computing technology provides a new research idea for the settlement of this issue. By combining autonomic computing with detection technologies, recovery technologies and decision-making methods,disteibuted mission critical system can recovery from system failures automatically and its high availability is guaranteed.However, the autonomic computing was still in its infancy and its application in distributed mission critical system failure recovery is also in a lack. Many basic issues such as how to build the autonomic recovery architecture of system as well as the implementation of autonomic failure detection, decision-making and recovery still need to be studied carefully. So,the system autonomic recovery mechanism was studied deeply in order to improve the self-recovery abilities of the system.Firstly, in order to fulfill the system specific requirement, its architecture DARA (DMCS Autonomic Recovery Architecture) was proposed based on the autonomic computing concept. In this architecture, autonomic recovery was divided into knowledge level, management level and target level.The functionality of each level was analyzed. From the architecture perspective, a failure "detection--decision-making--recovery" control loop was formed which can lower the complexity of autonomic recovery. Then, the system recovery management knowledge database including system entity dependency model, state model and management strategy was built which can provide support for failure detection and recovery. The architecture formalization and validation based on% calculus was carried out to prove the rationality of architecture.Secondly, failure detection of distributed mission critical system was studied from detection method and massage transfer mechanism. On the detection aspect, to meet the high accuracy of runtime environment failure detection, A-Hybrid detection method was proposed. This method can detect and locate failure objects through applying application configuration model, server model and host model. On the massage transfer aspect, according to the loose-coupling massage requirements,a mechanism for detection message transfer based on subscription/publishing was proposed. Experimental results showed that compared with other detection methods, A-Hybrid method can accurately detect failures and identifies the specific failure objects.Thirdly, from the aspect of application components and runtime environment, a decision-making method about autonomic recovery of system was studied. For the application components,to solve the low decision-making efficiency problem with failure strong correlation, a recovery decision-making method was proposed based on reboot tree optimization. To achieve the optimization of reboot tree, the components with high failure correlation were unite as a whole reboot group by computing the failure relevancy degree,and then a recovery plan was made based on the reboot tree and detection results.Examples showed that compared with the method without the reboot tree optimization, our method can achieve high efficiency and less recovery time. At the same time, considering the diversity of runtime environment failure scenarios,a decision-making method based on AI planning was put forward. A domain description was carried out between the dependencies of objects in runtime environment, and the initial state and goal state were determined by detection results and target policy, and then the recovery plan was made by planner. Experimental results showed that the decision-making method based on AI planning can generate relevant recovery plan effectively.Finally, autonomic recovery implementation for mission critical system was studied from two aspects:application components and runtime environment. For the application components,by clustering the reboot objects as different microreboot elements,a multi-granularity microreboot method was proposed for transient failure recovery in order to achieve high availability. Experimental results showed this method need less 48% reboot time than traditional reboot method which helps to achieve high availability. For runtime environment, a recovery method based on scripts was put forward, which focused on the relationship between recovery plan and scripts,moreover, the generating time of scripts under different failure degree in runtime environment was studied, which can provide flexible recovery plan according to different application environment requirement.
Keywords/Search Tags:Distributed Mission-Critical System, Autonomic Computing, Detection, Decision-making, Recovery
PDF Full Text Request
Related items