Font Size: a A A

Research On Fault-Tolerant Checkpointing Algorithm And In Software Design

Posted on:2013-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:C Q ShiFull Text:PDF
GTID:2248330374482646Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, more and more distributed systems are used in many industries, such as military, aviation and financial systems. With the increase in complexity of distributed software for distributed system and the number of the nodes, distributed system may have more and more failure and be worse and worse reliability. If failures occur and there are not protection measures, these failures may result in significant loss of life and property. So the research of fault-tolerant checkpointing technology is very useful.This research is sponsored by Natural Science Foundation of Shandong Province of China under Grant No. Z2008G03. This paper describes the research significance and the development status of checkpointing technology first, and then introduces the fault model of distributed systems and fault-tolerant components. We propose a coordinated checkpointing algorithm based unreliable non-FIFO channel. In unreliable non-FIFO channel, the system can lose, duplicate, or reorder messages. The processes may not compute some messages because of message losses; the processes may compute some messages twice or more because of message duplicate; the processes may not compute messages according to their sending order because of message reordering. The above-mentioned problems make processes produce incorrect computation result, consequently, prevent processes from taking consistent global checkpoints. Our algorithm assigns each message a sequence number in order to resolve above-mentioned problems. During the establishing of the checkpoint, the consistency of checkpoint can be determined by the sequence number of sending and receiving messages. We can identify the lost messages, reordering messages and duplicate messages by checking the sequence number of sending and receiving messages. We resolve above-mentioned problems by resending the lost messages, buffering the reordering messages and dropping the duplicate messages. Our algorithm makes processes take consistent global checkpoints. This paper also describes the setting and recovery of the Windows process checkpoints, which is divided into the storage and resume of user address space and kernel objects. Finally, we simulate the setting and resume of Windows process checkpoints in Visual Studio2005.
Keywords/Search Tags:fault-tolerance, unreliable non-FIFO channel, consistentglobal checkpoints, Windows checkpointing
PDF Full Text Request
Related items