Font Size: a A A

Research On Incremental Checkpointing And Rollback Recovery

Posted on:2015-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:P F LuFull Text:PDF
GTID:2348330518970249Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
High performance computing systems are now being increasing built off many compute nodes with achieve high-speed interconnects. The scale of cluster systems are being developed for better performance, meanwhile the number of its failure points increase exponentially, fault-tolerance systems and self-healing are becoming extremely important.Checkpointing is an effective fault-tolerant technology that prevents a process from restarting from the initial state when a system failure happens. By setting a checkpoint, a process can recover from the most recent checkpoint state and reduce the execution time of a process in the condition of failures.There are additional overheads in order to improve reliability of system by checkpoint mechanism. Incremental checkpointing is widely used to reduce checkpoint overheads in practical system environments, especially in high-performance computing systems. Usually checkpointing save the whole information of process, while incremental checkpointing save dirty pages that have been modified since the last checkpoint. Incremental checkpointing significantly reduced checkpoint size and overheads.This paper focused on the research of incremental checkpoint technology. Through the analysis of page-level incremental checkpointing and word-level incremental checkpointing,page-level incremental checkpointing was used in this paper. Using page save technology and virtual memory area save technology, incremental checkpoint system was designed and implemented by building a kernel module. In this paper, the write bit of page table entry was used to identify dirty pages which had been modified. Detection of virtual memory area changes by modifying the system calls table,an appropriate data structure was designed to record memory region changes. In order to reduce the rollback recovery overhead of incremental checkpoint, hybrid full checkpoint and incremental checkpoint technique was designed, checkpoint files were read from back to front.This paper supported detection of memory region changes, the kernel code and application code were not required to modify, and this system was transparent to the user. The same page was restored only once, and the page that had been deleted needed not to restore,which could effectively reduce the rollback recovery overhead.
Keywords/Search Tags:Fault-tolerance, Page-level Incremental Checkpoint, Rollback Recovery, Linux Kernel, Checkpointing Overhead
PDF Full Text Request
Related items