Font Size: a A A

For Grid Checkpoint Technology

Posted on:2007-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ZhangFull Text:PDF
GTID:2208360182993776Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The grid breaks through the current computational barriers, and integrates the distributed resources to provide high performance power. Utilizing the free computing resource, it coordinates scientific computing and problem solving on large scale. With the rapidly expansion of the scale of the grid, the probability of failure in the process of computing also increases. The exception of the computing nodes will lead to the complete failure of the job, and the previous result may be not available. Moreover, the grid is the collection of distributed resources, so the resources which are available now may be no longer available another time. Thus, to improve the reliability of grid computing, the grid should provide adequate fault-tolerance service.Checkpoint/Restart is an effective technology to provide fault-tolerance service. Implementing checkpoint mechanism in grid environment could improve the quality of services for scientific computing. In this dissertation, we have conducted a sophisticated theoretical and technical study on checkpoint technology, and improved the thread-based kernel level checkpoint to speed up checkpointing.According to the specificity of grid environment, this paper also studies the checkpoint mechanism for grid environment. When a node failure happens, it could restore the work from the former saved state and continue it to without re-implementation of the entire job. It saves a lot of time and provides fault tolerance. Taking full advantage of the grid data replica service, the files are stored in the remote nodes to ensure that the checkpoint files are available. When the computing nodes encounter failure, the task could migrate to other nodes to continue. We also adopt parallelism transfer to improve performance. Using this mechanism, we realize checkpoint module in MASSIVE (Multidisciplinary Applications-Oriented Simulation and Visualization Environment) environment. The module wills checkpoint the tasks periodically, and a job could be restarted from the most recent checkpoint to continue successfully. On the basis of the module, the MASSIVE could implement task migration and dynamic load balance.
Keywords/Search Tags:grid computing, checkpoint, MASSIVE, fault tolerance, task migration, dynamic load balance
PDF Full Text Request
Related items