Research On Rollback Recovery Fault-Tolerance Technology In High Availability Cluster

Posted on:2007-07-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J G Wang

Full Text:PDF

GTID:1118360215459706

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the rapid development of computing technology, the cluster system has been applied widely. With numerous applications running on the cluster platform, high availability becomes highly desirable. High availability cluster can provide highly reliable integrated service for computing tasks to meet hardware and software faults. Furthermore, many of applications require that fault-tolerance scheme should be provided with low overhead in terms of hardware and software resources. Rollback recovery technology seems meeting this demand quite well, providing an attractive low-overhead solution to building fault-tolerance application. However, applying rollback recovery to the cluster can bring some challenges, such as checkpoint placement of real-time tasks, fast failover, and reliable cluster communication protocol. This dissertation aims at solving key questions which rollback recovery facing in high availability cluster. The following original researches are carried out:The dissertation systematically introduces the basic principle, model, algorithms and recent research results about rollback recovery. The features and applied area of various protocols are also deeply analyzed.In a real-time cluster system, each task must complete and produce correct output by the specified deadline. However, it is not possible to meet each deadline because of system faults. So, it is very important to decrease precision of real-time task and provide a fault-tolerant optimal scheduling to assure timeliness and reliability. The dissertation analyzes system failure distribution and provides optimized checkpoint placement arithmetic based on imprecise computation (IC-CPS). This arithmetic can provide fault tolerance and real-time guarantees for multi-task real-time system. Numerical examples proved IC-CPS could increase the performance of system fault-tolerant, real-time tasks can tolerant more faults under the premise that tasks complete on time.Rollback recovery technology can be used to increase the availability of service of network applications. One key aspect of rollback recovery is failover -the reconfiguration of available resources and restoration of state required to continue providing the service despite the loss of some of the resources and corruption of parts of the state. Most of the failover schemes used for increasing the availability of network services do not provide service availability. Other schemes require deterministic servers or changes to the client. The dissertation proposes a service-oriented fast transparent failover model (SOFailover) and arithmetic for providing fault-tolerant network service that does not have the limitations mentioned above, and proves service failover constraint of arithmetic. Experiment results indicate SOFailover has short failover times and low overhead during fault-free operation. Nondeterministic system can fast transparent failover under the premise of ensuring service availability.Rollback recovery raises the difficulty of communication in cluster. The dissertation deeply studies existing problems and limitation of primary data transfer protocols at present, proposes a reliable and efficient data transfer protocol based on UDP (REUDP), also provides an analytical model to predict REUDP's performance. REUDP make efforts to enhance the ability of cluster system communication under assured reliability. Experiment results show that REUDP performs extremely efficiently over high-speed cluster networks and analytical model is able to provide good estimates of its performance.As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional fault tolerance methods have many limitations and don't meet the fault-tolerant demand of large-scale heterogeneous cluster system. The dissertation presents a rollback recovery scheme for large-scale that aims at low overhead on the forward path and a fast recovery from faults, without wasting computation done by processors that have not faulted. The scheme does not require any individual component to be fault-free. Moreover, the scheme can effectively reduce the impact of losing computing nodes (load imbalance due to crash) by the capability of automatic load balancing at run-time.Predominant industrial practice has evolved from general-purpose class libraries to do-main-specific frameworks and design patterns. According to the characteristic of high availability system software and the impact of reusable technology in software development, proposes the reusable framework of rollback recovery based on pattern language (RRAF). This framework can include above arithmetics and schemes. In the dissertation, system architecture and operational principle of framework, establish of pattern language and collaboration relations among patterns are discussed in detail. The application of framework in streaming media server software is also discussed. It is a profitable exploration to application of pattern language in domain and large-scale software reusability.

Keywords/Search Tags:

Software Fault-tolerance, Cluster System, Rollback Recovery, Failover, Checkpoint

PDF Full Text Request

Related items

1	The Research And Implementation Of Checkpoint Technology Based On WinNT
2	The Research On Low-overhead Rollback Recovery Fault-Tolerance Technology
3	Fault-Tolerant Of MPI Programs Based On Rollback Recovery
4	Research On Incremental Checkpointing And Rollback Recovery
5	Dynamic Cluster Strategy For Hierarchical Rollback-Recovery Protocols
6	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
7	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
8	Cluster Oriented Fault Tolerance For MPI Parallel Applications
9	Research On Low Overhead Non-blocking Checkpointing Scheme For Mobile Computing System
10	Research On Checkpointing And Rollback Recovery Fault-tolerant Techniques For Mobile Computing Environment