Font Size: a A A

Research On Recovery-Oriented Fault-Tolerant Computing Technique

Posted on:2008-12-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:H S LiFull Text:PDF
GTID:1118360272979906Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cost, realizability and scalability of the highly reliable fault-tolerant computers are the main issues which bother the development and limit the wide deployment of these computers. However, the development of fault-tolerant computers based on hardware and software resource of commercial-off-the-shelf (COTS) computers has advantages over traditional proprietary fault-tolerance design techniques, and will become the research hotspot. However, there are two important and difficult problems, i.e., the transparency of fault-tolerant computing to end users and how to cut down the impact on normal functioning of the systems while increasing the efficiency of fault detection, diagnosis and recovery when developing fault-tolerant computers based on COTS techniques by the way of coordinated hardware and software. Moreover, the drawback with the widely used and rollback-based fault recovery mechanism must be overcome.After deep investigation into working principles of some native and overseas fault-tolerant computer systems and clustering techniques, the author proposes the structure model of a kind of fault-tolerant cluster server based on COTS, and the fault recovery model and scheme in order to meet the requirements of high reliability, high availability and high performance on servers in some fields. The OPIAC fault-tolerant cluster server based on Linux and PC platform is designed and realized, combining fault-tolerance technique with advantages of cluster. Different fault occurring occasions in the server are analyzed and researched, with the emphases on the fault recovery mechanism and scheme, how to realize checkpointing and system performance evaluation of that kind of server.Firstly, after analyzed architecture of several typical fault-tolerant computer systems, study is done on the fault recovery mechanism, method and way to implement checkpointing in the present fault-tolerance design. As a result, the author points out the key aspects affecting on the cost needed by the checkpointing technique, trend of checkpointing technique, difficulties confronted when designing intelligent fault-tolerant computing based on COTS components and checkpointing technique, and some primary technical problems necessarily to be solved.Secondly, the author has investigated the different fault occurrences and recovery situations in TMR structure both with backup modules and without backup modules, and analyzed and compared quantitatively the efficiency of fault recovery algorithms used in the two different structures. After studied the impact of checkpoint interval, i.e., time between two consecutive process context savings, on the normal system execution, the author puts forward a dynamic checkpoint setup strategy that saves dynamically the process status to meet the requirement needed by real-time applications. To further improve the efficiency of fault recovery, a transparent and parallel fault recovery algorithm for intelligent fault-tolerant systems, the ladder algorithm, is also proposed.Thirdly, the author suggests that highly reliable fault-tolerant server with high performance be constructed from combination of fault tolerance technique and clustering technique, and be based upon COTS techniques. Accordingly, based on the platform of Linux and PC, the OPIAC fault-tolerant cluster server with high reliability, availability and high performance has been designed and realized. By modifying and expanding the Linux kernels, the fault tolerance management module with the autonomous processing capability is adopted to implement transparence of fault-tolerance function to applications, i.e. on the one hand, there is no limitation and added requirements on the coding and running of applications that will run on the OPIAC fault tolerant cluster servers; on the other hand, for the client applications communicating with the servers, they won't feel the fault detection, diagnosis and recovery inside the server or the transport process of the service processes among the internal nodes, and they won't affect the establishment of new network connections. Detailed description is presented on measures taken to efficiently reduce time for fault recovering and to raise the execution efficiency during the fault recovery. In the design of I/O subsystem, the virtual device driver layer, device resource management layer and kernel service simulation layer are devised and implemented. With the layers and log based fault recovery algorithm, shortages of traditional checkpointed I/O recovery could be overcome.Finally, how to save process context and recover in Linux operating system is elaborated.
Keywords/Search Tags:Fault-tolerant cluster server, Fault recovery, Forward recovery, Transparency, Dynamically Checkpointing strategy
PDF Full Text Request
Related items