Font Size: a A A

The Systematic Study Of Fault-tolerant Die

Posted on:2007-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LinFull Text:PDF
GTID:2208360182493810Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of cluster technology, clusters have become more and more popular in the scientific computing field. In theories, cluster model combines both high availability and high performance, which is manageable and scalable. However, in fact, that is only a great dream in theory, because there's still a long way for the cluster software to achieve that. When the cluster was applied in the high performance field, the high availability was noticed as a problem. As the increasing hardware and software complexity, the better performance in the application program is pursued, the more guarantee is needed.In this paper, we build a high performance computing fault tolerance deathless system, which focuses on the scientific computing application program and provides guarantee service and knight service for users. The guarantee service insures that the user will only submit the task to the system once, and the task will be finished even in tough time such as any nodes failing in grid. The whole procedure can be done without any intervention from user. The knight service means that the system provides some intelligent modules used for system and task performance prediction. The whole running time of task will be shortened by adjusting the load of individual node in the grid and doing some effective subtask migrations.In this paper, we construct a task model for parallel computing problem, which consists of user, manager node and worker node. In order to make task partition more easily and management more effectively, we adopt the way just like the operation of folder and files, and use command line with parameters to invoke the program. The guarantee service is based on the task-relive technology in individual node and among multi nodes. The task-relive means that the task can be recovered to the original status from death and continue running after contingency. The knight service is based on performance prediction technology and task migration technology. The performance prediction can find the fittest node as the target node for migration according to the system load. The task migration technology can make the task migrate from one nodeto another and continue running. The checkpoint/restart is the most important technology in the implement of the guarantee service and knight service. There are various methods to implement the checkpoint/restart technology, including user level and system level, with source code modified or not, and so on. In this paper, we have surveyed and analyzed the current status of the art of checkpoint/restart mechanisms. According to the situation of our experiments, we present the solution of fault tolerance deathless system. In the end, we analyze the system based on the experiment results, and the future work is presented.
Keywords/Search Tags:task relive, task migration, checkpoint/restart, cluster, parallel, high performance computing, grid computing, guarantee service, knight service, process image
PDF Full Text Request
Related items