Font Size: a A A

Improving Availability With Fine-grained Failure Detection And Recovery

Posted on:2007-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y J JinFull Text:PDF
GTID:2178360182493710Subject:Computer applications
Abstract/Summary:PDF Full Text Request
In the era of information technology, high availability is attached increasingly more importance for providing 24-hour non-stop service. Failure detection and recovery is central to achieve higher availability, therefore various failure detection and recovery techniques have been studied in theory and employed in high availability systems. However, it is less investigated to combine and integrate these existing schemes to provide a comprehensive and adaptive failure detection and recovery approach.This thesis proposes a fine-grained failure detection and recovery approach by combining various failure detection and recovery techniques in a hierarchical and systematical way. This approach is composed of four levels, ranging from inner-process level to box level. For each level, detection and recovery techniques are introduced to detect and recover from various failures of this level. To reduce the recovery time, the partial restart technique is utilized in the inner-process level. Moreover, the approach is designed to be dynamically reconfigurable to make it adaptive to requirements of ever-changing environment.The fine-grained failure detection and recovery approach is applied to the 24-hour FB project which is aimed to consolidate the messaging infrastructure of an accounting system. The original failure detection and recovery mechanism is extended using the four-level approach, thus enabling it to detect inner-process errors, and respond much faster to various failures. Besides, a unified configuration interface is added to the control GUI of the system to facilitate configuration and management.The implementation of fine-grained failure detection and recovery in 24-hour FB system is evaluated through experiments. The evaluation results provide more ideas of the performance overhead, recovery time and availability of proposed approach.
Keywords/Search Tags:Failure Detection, Failure Recovery, High Availability, Partial Restart, Heartbeat
PDF Full Text Request
Related items