Font Size: a A A

Research Of Rollback Recovery Based On Dependency Tracking And Message Counting

Posted on:2014-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:G B YuanFull Text:PDF
GTID:2268330425484196Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present, a large number of scientific research and engineering applications runon distributed computing systems. But with the expansion of the system scale and theincrease in the number of system nodes, the probability of failure increases when thesystem runs. To still be able to ensure the correctness of the results after the failure orabnormal, or to meet the needs of the applications, the system must have the ability tofault-tolerant. Rollback recovery fault-tolerant technology based on time redundancyfor fault tolerance without node redundancy is the mainstream technology to achievereliability of high-performance distributed computing. While the rollback recoverytechnology will bring a lot of overhead for protecting system reliability and theoverhead greatly limits the application and development of it. Research to reducerollback recovery protocol overhead and improve system execution efficiency hasimportant significance. The main research in this thesis includes the following twoaspects:First, for the problem that the message logging overhead caused bysynchronization constraints is great in the traditional message logging protocol, alight-weight message logging protocol based on dependency tracking is proposed inthis thesis. The protocol utilizes messaging features in runtime and appliesinformation piggyback strategy to relieve the synchronization constraints in themessage logging. In this protocol, message data stored in the sender, not impose anyconstraints message and submitting information passed with the message save in therelying party of dependencies extension, this storing way did not introduce anyconstraints. Message submitting information track by saving party try to avoid theunnecessary transmission and reduce the amount of piggyback message, this way havethe lightweight characteristics. Experimental results show that, compared with theEgida protocol, the logging overhead and the checkpointing overhead are reduced byabout10%.Second, for the problem that there is usually larger blocking or coordinatedoverhead in existing coordinated checkpoint protocols, a non-blocking coordinatedcheckpointing protocol based on message counting is proposed in this thesis. Theprotocol divides runtime state of process into three kinds, utilizes the feature that theprobability of checkpointing is much high than probability of failure occurrence during distributed parallel programs run, using the information piggyback strategyand non-blocking execution mechanism, transfers part of coordinated overhead in thecheckpointing phase to failure recovery phase. In addition, the protocol identifies thecommunication situation of process in a checkpoint interval to avoid the process to setup unnecessary checkpoints, thereby reducing the overall overhead of checkpointingphase. Experimental results show that, compared to the two-stage checkpoint protocol,the coordinated checkpointing overhead is reduced by20-40%; compared to thedistributed snapshot protocol, the coordinated checkpointing overhead is reduced byabout20%.
Keywords/Search Tags:Rollback Recovery, Overhead, Coordinated Checkpoint, MessageLogging, Dependency Tracking
PDF Full Text Request
Related items