Font Size: a A A

Research On Based On Message Number Checking And Message Rearranging Theory

Posted on:2014-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:J CaiFull Text:PDF
GTID:2248330398961077Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the sustainable development of large distributed systems, people pay attention to the reliability of the distributed system. For example, China’s Tiahe-1A, transportation system and FT-MPI system based on MPI and so on. These systems are not only related to the development of society and economy, they are also closely related to every one of us. Quality assurance as the characteristics of falut tolerance gives impetus to the wide application and importance of fault tolerance. Fault tolerance of the distributed system can be understood as to tolerate errors and to eliminate error influence. Fault tolerance of the distributed system mainly divided into the prior to fault tolerance and rollback fault tolerance. Considering the storage capacity and recovery procedure, compared with the prior to fault tolerance technology, backward fault tolerance technology is used more widely in the practical application.This research is sponsored by Natural Science Foundation of Shandong Province named "the research and implementation of technology on the heterogeneous distributed system based on the rollback fault tolerance". Rollback fault tolerance technology basically has two kinds:rollback fault tolerance technology based on checkpoint and rollback fault tolerance technology based on the message logging. How to preserve the state of the system of distributed system and how to restore a global consistent state after the distributed system fails are the two main problems needed to solve in the rollback fault tolerance technology. Large literatures are presented to deal with the determination of a consistent global state, but they exit different defects. This paper introduces a new method for solving the determination of a global consistent state in the distributed systems, namely the message number detection method. If the states of all receiving messages are recorded in a distributed system states, then the distributed system states can be consisted as a consistent global system state.Innovations are described as follow in our paper.(1) The message number detection theory is put forward. The orphan messages and in-transit messages can be determined in the distributed global state by judging the number of received messages and the number of sent messages of process. And if no orphan message exit, then the distributed global system state is consistent.(2) A new algorithm is introduced on finding the maximal and minimal consistent global checkpoints contain a giving set of checkpoints based on message number detection theory. In this algorithm, we first determine whether orphan messages exits in the set of given checkpoints using message number detection theory or not, which can reduce the time-overhead. If orphan messages exit in the set of given checkpoints, then the maximal and minimal consistent global checkpoints contain a giving set of checkpoints are not exited in the distributed systems. Otherwise, the algorithm use global search to find the maximal and minimal consistent global checkpoints contain a giving set of checkpoints.(3) Message rearranging theory is proposed. This theory first introduces the concept of always happen-before (AHB) relationship between events. And the improved logical clock of process is used to show AHB relationships between events. Then the theory presents that the result executed by rearranging message order during recovery is equal to the result before failure. Last the problem of message order missing in the oprimisitc message logging protocol is solved int the theory.(4) A message logging protocol is proposed based on the message number detection theory and message rearranging theory. The advantages of this protocol is:①It indicates that reproducing lost messages in exact order as failure before is impossible.②The message receiving process can logs and sends messages asynchronously based on message receiving protocol.
Keywords/Search Tags:distributed systems, fault-tolerance, checkpoints, message logging, rollback recovery
PDF Full Text Request
Related items