Fault-tolerant Computing Based On Statistical Learning Technology

Posted on:2008-09-26

Degree:Master

Type:Thesis

Country:China

Candidate:Q J Zhu

Full Text:PDF

GTID:2178360245991819

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The increasing complexity of modern computer systems makes fault detection and localization prohibitively expensive, therefore fast recovery from failures is becoming more and more important. A significant fraction of failures can be cured by executing specific repair actions, e.g. reboot, even when the exact root causes are unknown. However, designing reasonable recovery policies to schedule potential repair actions effectively could be difficult and error-prone. So we attempt to find an automatic way to generate recovery policies with high performance.In this paper, we present a novel approach to automate recovery policy generation with Reinforcement Learning techniques. We first formalized the automatic error recovery problem, and proved the effectiveness of our reinforcement learning method in theory. Then based on the recovery history of the original user-defined policy, we verified two different implementations of the method through experiments.One is the direct online learning approach, which directly applies the learning process in the running system. Through experimental analysis, we can find that our learning method can obtain better performance, have scalability and convergence which leads to the globally optimal policy. Besides, by introducing the special error pattern, we still verify the effectiveness of our method.The other one is the offline learning approach, which applies the learning process on the error recovery history to generate recovery policies. Affected by original user-defined policy, the recovery policy generated by our method is locally optimal, but it outperforms the original one. In our experiment on the data from a real cluster environment, we found that the automatically generated policy finally saves 10% machine downtime. Moreover, to avoid the situation in which the learned policy may not handle some error cases, we proposed a hybrid method, which could take advantages of both two methods, not only maintaining the high performance of the learned policy, but also handling all possible cases like user-defined policy.

Keywords/Search Tags:

Fault-tolerant Computing, Statistical Learning Theory, Q-learning, Reinforcement Learning, Automatic Error Recovery

PDF Full Text Request

Related items

1	Reinforcement Learning Based On Spectral Graph Theory
2	Research On Error Bound Theory And The Statistical Feature Of Machine Learning Algorithms
3	Research On Reinforcement Learning Based Control Method Of Magnetic Navigation AGV
4	Kernel Learning Algorithms And Ensemble Methods
5	Research On Key Technology Of Fault-Tolerant Nanoscale Circuit Based On Statistical Model
6	Supervised Reinforcement Learning:methods And Applications
7	Research And Implementation Of Reinforcement Learning Method About Transport Strategy Between Carrier-based Aircraft Station
8	Research On Recovery-Oriented Fault-Tolerant Computing Technique
9	The Research And Application Of Multiple Kernel Prediction Model Based On Statistical Learning Theory
10	Research On Laser Navigation AGV Control Method Based On Reinforcement Learning