Font Size: a A A

Fault-tolerant Computing Based On Statistical Learning Technology

Posted on:2008-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q J ZhuFull Text:PDF
GTID:2178360245991819Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The increasing complexity of modern computer systems makes fault detection and localization prohibitively expensive, therefore fast recovery from failures is becoming more and more important. A significant fraction of failures can be cured by executing specific repair actions, e.g. reboot, even when the exact root causes are unknown. However, designing reasonable recovery policies to schedule potential repair actions effectively could be difficult and error-prone. So we attempt to find an automatic way to generate recovery policies with high performance.In this paper, we present a novel approach to automate recovery policy generation with Reinforcement Learning techniques. We first formalized the automatic error recovery problem, and proved the effectiveness of our reinforcement learning method in theory. Then based on the recovery history of the original user-defined policy, we verified two different implementations of the method through experiments.One is the direct online learning approach, which directly applies the learning process in the running system. Through experimental analysis, we can find that our learning method can obtain better performance, have scalability and convergence which leads to the globally optimal policy. Besides, by introducing the special error pattern, we still verify the effectiveness of our method.The other one is the offline learning approach, which applies the learning process on the error recovery history to generate recovery policies. Affected by original user-defined policy, the recovery policy generated by our method is locally optimal, but it outperforms the original one. In our experiment on the data from a real cluster environment, we found that the automatically generated policy finally saves 10% machine downtime. Moreover, to avoid the situation in which the learned policy may not handle some error cases, we proposed a hybrid method, which could take advantages of both two methods, not only maintaining the high performance of the learned policy, but also handling all possible cases like user-defined policy.
Keywords/Search Tags:Fault-tolerant Computing, Statistical Learning Theory, Q-learning, Reinforcement Learning, Automatic Error Recovery
PDF Full Text Request
Related items