| The application fields of distributed system are more and more widely because ofthe characteristics that have small investment risk, have structure with good scalability,can inherit the original software and hardware resources, and have the advantages ofsimple structure. These fields including large-scale scientific computing system,telephone system, the aircraft booking system, bank system, stock system, shoppingsystem. With the constant enlargement of the scale of the system, the probability ofoccurring failure is growth in exponential and it may have disastrous consequences oncethe system failure. So there is an urgent need for the distributed computing systemprovides fault tolerance mechanism. Checkpointing and rollback recovery technology isa kind of important software fault tolerant technique, which can be realized easily andused easily, is suitable for application in a distributed computing environment.In the distributed computing environment, the characteristic of uncertainty of thecommunication bandwidth, storage space constraints, node dynamic and frequentdisconnection characteristic decided to the rollback recovery technology that bedeveloped for single computer can not be applied directly to the distributed computingsystem. Under the premise that ensure the consistency of the system, reduce checkpointand message log storage cost, reduce the communication cost of the rollback recoverymechanism, improve the node autonomy, decreased the coupling due to the process ofdependency relationship between the nodes, achieve the transparently of rollbackrecovery mechanism for nodes, which are the core problems of rollback recoveryresearch on technology. This paper is focuses on these aspects to elaborate.(1) In a distributed computing environment i, a lot of network structure is loosecoupling, node autonomy is very strong, we hope the fault-tolerant mechanism is a kindof transparent services, at the same time, need the asynchronous rollback recoverymechanism. We have presented a non-blocking coordinated checkpointing and rollbackrecovery algorithm for distributed systems, which are differ from the conventionalapproach of taking first temporary checkpoints and then converting them to permanentones by processes. The proposed checkpointing algorithm allows processes to takepermanent checkpoints directly, without taking temporary checkpoints. The character ofthe algorithm contributes to its speed of execution. The orphan messages are eliminatedby sender processes and the in-transit messages are eliminated by checkpointing interval and retransmission mechanism. To support the station just for each node keeps a recentcheckpoint, the log information to avoid synchronization, reducing the free errorrun-time overhead. After a node failure, which only need to broadcast onesynchronization message to the others processes and the others processes are processindependently according to the algorithm after the synchronization messages arereceived.Thereby realizing the node transparent and high parallel rollback recovery.(2) Aiming to the application that includes a number of nodes, we need to anadaptive mechanism to satisfy vary status because the frequency of the exchange ofinformation between nodes is not the same, even vary greatly. Aim to the characters ofdistributed systems, we have presented a two-level checkpointing and rollbackrecovery fault-tolerance algorithm based on dynamic group, which adopt cooperativecheckpointing algorithm in group-level and single phase checkpointing algorithm insystem-level. According to the communication frequency, communication delay,bandwidth and the number of nodes and other indicators to achieve dynamic packet. Asa result, the communication delay is small and the nodes are not many within group,therefore we adopts coordinated checkpointing algorithm in the group-level. It isusually composed of the networks of high time delay and low bandwidth networkconnected to each other between group, and the communication frequency is lowintergroup, the proposed system level checkpoint algorithm takes full account of thecharacteristics.The orphan messages are eliminated by sender groups and the in-transitmessages are eliminated by checkpointing interval and retransmission mechanism. Sothe obtaining system level checkpoints are consistent global checkpoint, which avoidingthe occurrence of Domino effect. on the one hand it dynamic adapt to the requirement ofapplication and enhance the efficiency to whole application,on the other hand theorphan messages are eliminated by sender processes and realize the change from twophase commit algorithm to single phase commit algorithm. The character of thealgorithm contributes to its speed of execution.(3) In a distributed environment, how to construct a general message passingmechanism, information platform, quasi real time transmission various checkpointmessage, is worthy of study. According to the characteristics of distributed system andcheckpoint algorithm, we present a messaging mechanism that can extend and adapt totransfer for variety checkpoint messages. The presented mechanism has suchadvantages: cross platform, easy to expand, quasi real time transmission.(4) Based on the theoretical research, design and realization the prototype system, verify the theory can be realized, from the theory research to practical applications isvery important work in engineering.We have researched on the system construction ofprototype system, the requirement analysis of the client software, software framework,function module, the processing flow, and have implemented a prototype systemcombine with the theoretical research, which proved that the theory achievements canbe realized. |