Font Size: a A A

Research Of Process Checkpoint Technology Based On Linux Kernel

Posted on:2010-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:L JiaoFull Text:PDF
GTID:2178360272480335Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing development of computer technology, parallel and distributed computing has achieved great advances and is used widely in science computing, information processing and mobile communication, especially in the field of high-performance computing. The rate of system faults increases correspondingly because of the growth of system scale, the complexity and run-time of the application and so on. As a result, in order to ensure the high reliability of the system, fault tolerant technology has been widely adopted and has become a hotspot in the field of computer research.Software fault tolerance is increasingly applied in various areas because of its characteristics of strong flexibility and low cost of implementation. As a popular software fault tolerant mechanism, the main idea of checkpoint and recovery technique is periodically saving the states of processes and generating checkpoint files during fault-free, after occurrence of fault, this mechanism rolls back the error process to the recent checkpoint state and restart the computation process. Checkpoint and recovery technique can be implemented in two modes: user-level system and kernel-level system.Through analyzing example systems of the two modes, the advantages and disadvantages of the two modes are discussed. Based on the above analysis, according to the Linux LKM (Loadable Kernel Module) mechanism, a method to design and implement a process checkpoint and recovery system HDCR based on the Linux kernel is proposed. Checkpoint and recovery kernel module is implemented using the Linux kernel thread and loaded into the Linux kernel level to provide the bottom mechanism of checkpoint and recovery. Based on this kernel module, a checkpoint library is constructed in the user-level to provide corresponding interfaces for users, by using some selected interfaces, the particular checkpoint and recovery algorithm can be implemented effectively. Furthermore, an algorithm named parallel-sync is proposed, it could work under parallel condition to ensure the checkpoints' states achieving a global consistent state.The experiment data demonstrate that the system is able to do transparent and high efficiently checkpoint operation for user processes, do roll back recovery without failure and the system also supports specific checkpoint algorithms for users. All these demonstrate that the system is a high dependable, high effective and flexible system.
Keywords/Search Tags:fault tolerance, checkpoint and recovery, user level, system level, kernel module
PDF Full Text Request
Related items