Font Size: a A A

A Checkpoint-Based Fault-Tolerant Service In Distributed Systems

Posted on:2017-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HuangFull Text:PDF
GTID:2348330491464014Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The increasing failure frequency prolongs the application completion time in distribution systems. Failure trace archiving failure data in many large scale distributed systems are released with standard format. All these make it possible to find better solutions for fault-tolerant services. A checkpoint fault-tolerant model in distributed systems is focused in this thesis. Correlation in distributed system is considered in the model. Furthermore, checkpoint placement strategies are explored to reduce task execution time. The proposed service, on the one hand, guarantees the system reliability, on the other hand, lowers the implementation cost of fault-tolerant services and improves the system efficiency.The main work and contributions include:(1) Checkpoint/restart technology and its implementation methods are studied. How to restore the state of communication state and ensure global consistent state in a distributed system is also explored. A prototype system is designed to release the relationship among the scale of the system, the application size and the global checkpoint cost.(2) A fault-tolerant model based on checkpoint and system failure event correlation is proposed. A method to cluster the distributed system nodes into correlation groups using the event analysis results is further discussed. A checkpoint placement strategy which includes two kinds of checkpoint is elaborated. Then the active checkpoint period is figured out according to an optimization function which minimizes the wasting time.(3) A checkpoint-based fault-tolerant service is designed and implemented based on the failure data in FTA. Simulation analysis for comparison is conducted. Firstly, the parameters in the models are evaluated. Then, the extra time for task execution with the rest of the failure data is studied. Experimental results show that the proposed model is efficient in real environment.
Keywords/Search Tags:Failur correlation analysis, Checkpoint/Restart fault-tolerant, Coordinated Checkpointing, Failure Trace Archive failure data
PDF Full Text Request
Related items