Font Size: a A A

Checkpoint Optimization Methods For Parallel Microreboot

Posted on:2018-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:L GuFull Text:PDF
GTID:2348330542987342Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Parallel programs are widely used in the fields of scientific computing,financial stocks,and national defense security.The execution cycle of parallel programs is usually in months.In order to cope with random errors and deliberate attacks,fault-tolerant mechanisms are employed to ensure parallel programs to run correctly.As an effective method of fault tolerance,the core of microreboot is to recover the state of local process and ensure the consistency of global results.Restricted by the current frameworks of checkpoint,the research on microreboot is mainly for serial programs,and the supporting for parallel programs is still many limitations.This paper proposes a general parallel microreboot framework which can solve hardware fault well and effectively,a multilevel checkpoint strategy and relative optimization methods.Firstly,a general parallel microreboot framework based on the existing researches about parallel fault tolerance and microreboot is designed in the paper,where the checkpoint storage can work well with the parallel environment.Aiming at the limitation of hardware fault tolerance,the paper proposes a multilevel checkpoint strategy for parallel microreboot by combining the advantages of in-disk and in-memory checkpoint.Based on the in-disk checkpoint,the proposed strategy uses dual memory checkpoints to solve the problem of checkpoint files loss after hardware failure.Furthermore,the paper implements a process migration mechanism for the in-memory checkpoint,which can balance the server cluster load after hardware failure.At the same time,a dynamic period adjustment algorithm of the in-memory checkpoint is designed in the paper,which can reduce the computation resources cost.Secondly,a storage optimization method with regard to the multilevel checkpoint strategy is proposed in the paper,which can reduce the size of checkpoint files by investigating the memory layout of multilevel checkpoint.According to the characteristics of data distribution in scientific calculation programs,the paper implements an incremental checkpoint with zero block detection using hash function to track the change of memory,which can reduce the size of the checkpoint files.Considering the tradeoff between compression efficiency and system resources cost,a multilevel checkpoint compression algorithm is designed in the paper,which can further reduce the disk and I/O cost of checkpoint files.The experiment results show thatthe proposed storage optimization method can improve the efficiency and recovery rate of the proposed method.Finally,a near-optimal period calculation method for the multilevel checkpoint is designed in the paper.By symbolizing the execution of a parallel program,the paper defines the cost of checkpoint in terms of time complexity.On the basis of this definition,the checkpoint period optimization problem is modelled as a nonlinear checkpoint cost problem.And then,the checkpoint cost formula is obtained by analyzing possible fault locations;two deceleration parameters and an acceleration parameter are introduced to reflect the impact of message logging on the multilevel checkpoint.The experiment results show that this method can improve the efficiency of the proposed method.
Keywords/Search Tags:Parallel microreboot, multilevel checkpoint, checkpoint files storage, checkpoint period
PDF Full Text Request
Related items