Research On Recovery-Oriented Fault-Tolerant Computing Technique

Posted on:2008-12-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H S Li

Full Text:PDF

GTID:1118360272979906

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cost, realizability and scalability of the highly reliable fault-tolerant computers are the main issues which bother the development and limit the wide deployment of these computers. However, the development of fault-tolerant computers based on hardware and software resource of commercial-off-the-shelf (COTS) computers has advantages over traditional proprietary fault-tolerance design techniques, and will become the research hotspot. However, there are two important and difficult problems, i.e., the transparency of fault-tolerant computing to end users and how to cut down the impact on normal functioning of the systems while increasing the efficiency of fault detection, diagnosis and recovery when developing fault-tolerant computers based on COTS techniques by the way of coordinated hardware and software. Moreover, the drawback with the widely used and rollback-based fault recovery mechanism must be overcome.After deep investigation into working principles of some native and overseas fault-tolerant computer systems and clustering techniques, the author proposes the structure model of a kind of fault-tolerant cluster server based on COTS, and the fault recovery model and scheme in order to meet the requirements of high reliability, high availability and high performance on servers in some fields. The OPIAC fault-tolerant cluster server based on Linux and PC platform is designed and realized, combining fault-tolerance technique with advantages of cluster. Different fault occurring occasions in the server are analyzed and researched, with the emphases on the fault recovery mechanism and scheme, how to realize checkpointing and system performance evaluation of that kind of server.Firstly, after analyzed architecture of several typical fault-tolerant computer systems, study is done on the fault recovery mechanism, method and way to implement checkpointing in the present fault-tolerance design. As a result, the author points out the key aspects affecting on the cost needed by the checkpointing technique, trend of checkpointing technique, difficulties confronted when designing intelligent fault-tolerant computing based on COTS components and checkpointing technique, and some primary technical problems necessarily to be solved.Secondly, the author has investigated the different fault occurrences and recovery situations in TMR structure both with backup modules and without backup modules, and analyzed and compared quantitatively the efficiency of fault recovery algorithms used in the two different structures. After studied the impact of checkpoint interval, i.e., time between two consecutive process context savings, on the normal system execution, the author puts forward a dynamic checkpoint setup strategy that saves dynamically the process status to meet the requirement needed by real-time applications. To further improve the efficiency of fault recovery, a transparent and parallel fault recovery algorithm for intelligent fault-tolerant systems, the ladder algorithm, is also proposed.Thirdly, the author suggests that highly reliable fault-tolerant server with high performance be constructed from combination of fault tolerance technique and clustering technique, and be based upon COTS techniques. Accordingly, based on the platform of Linux and PC, the OPIAC fault-tolerant cluster server with high reliability, availability and high performance has been designed and realized. By modifying and expanding the Linux kernels, the fault tolerance management module with the autonomous processing capability is adopted to implement transparence of fault-tolerance function to applications, i.e. on the one hand, there is no limitation and added requirements on the coding and running of applications that will run on the OPIAC fault tolerant cluster servers; on the other hand, for the client applications communicating with the servers, they won't feel the fault detection, diagnosis and recovery inside the server or the transport process of the service processes among the internal nodes, and they won't affect the establishment of new network connections. Detailed description is presented on measures taken to efficiently reduce time for fault recovering and to raise the execution efficiency during the fault recovery. In the design of I/O subsystem, the virtual device driver layer, device resource management layer and kernel service simulation layer are devised and implemented. With the layers and log based fault recovery algorithm, shortages of traditional checkpointed I/O recovery could be overcome.Finally, how to save process context and recover in Linux operating system is elaborated.

Keywords/Search Tags:

Fault-tolerant cluster server, Fault recovery, Forward recovery, Transparency, Dynamically Checkpointing strategy

PDF Full Text Request

Related items

1	Fault-Tolerant Of MPI Programs Based On Rollback Recovery
2	Distributed File System Level Fault-tolerant Mechanism
3	Research On Fault Recovery Techniques For Soft Errors Of COTS DSP
4	Modeling Of Fault Diagnosis And Recovery Function Of Fault-tolerant System
5	Research On Checkpointing And Rollback Recovery Fault-tolerant Techniques For Mobile Computing Environment
6	Parallel Computing In The Host Fault Tolerant Mechanism Studies
7	Recovery in fault-tolerant distributed microcontrollers
8	Research On Non-Blocking Coordinated Checkpointing Algorithm In The Mobile Computing Environment
9	Research On Incremental Checkpointing And Rollback Recovery
10	Cluster Oriented Fault Tolerance For MPI Parallel Applications