Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint

Posted on:2009-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:X P Lu

Full Text:PDF

GTID:2178360278456779

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently, HPC systems, which reflect the overall science and technology level of a country, are implemented in many fields such as military, economy, science and technology, etc. With both the structure and the application scale becoming much huger, HPC systems tend to be greater and more complex. Consequently, the fault ratio rises exponentially, and the job time is markedly longer. Hence, it's significant that we study the fault tolerant technology in HPC systems to improve the availability.Aiming at the lack of fault tolerant technology in HPC systems, this paper studies the key fault tolerant technologies in resource management system, designs and implements the automatic jobs fault tolerant technology based on resource management system. The main works and innovations are as follows.1. The fault tolerant technology is an important way to improve the availability of HPC systems. Aiming at the function defect in recent fault tolerant function of HPC systems, this paper puts forward the automatic jobs fault tolerant framework based on recourse management system. Under the framework, the automatic jobs fault tolerant function is successfully implemented, and the availability and efficiency of HPC systems are improved.2. Aiming at the function defect in recent HPC systems, this paper studies recent fault detection technology, and puts forward a fault detection model based on node component in HPC systems. Compared with interrelated technology in recent HPC systems, the characteristic of this model is evaluated.3. Aiming at the lack of automatic job checkpoint function in recent HPC system, this paper studies recent parallel application checkpoint, designs and implements automatic CHECKPOINT/RESTART mechanism based on resource management system. With the automatic CHECKPOINT/RESTART function implemented by the mechanism, lots of resource waste caused by repeated computing is avoided. Meantime, the technique requirement is reduced for users to perform system management.4. With the NAS Parallel Benchmark, this paper evaluates the system in terms of both function and performance. The results indicate that the automatic faults detection function and jobs CHECKPOINT/RESTART function are implemented, and the overhead is low. So we get a conclusion that in our design, automatic fault tolerant function is implemented with low additional overhead, and the availability of HPC systems is remarkably improved.

Keywords/Search Tags:

HPC, Fault Tolerance, ROC, SLURM, CHECKPOINT/RESART

PDF Full Text Request

Related items

1	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
2	The Design And Research Of Process Level Fault-tolerance Based On Checkpoint
3	The Research And Implementation Of Checkpoint Technology Based On WinNT
4	Research And Implementation Of PVM-based Cluster Fault-tolerance Method
5	Fault Tolerance Strategy Of Mobile Cloud Based On Hierarchical Checkpoint
6	Research On Checkpoint Subsystem For Linux SSI Cluster
7	Research On Implementation Technologies Of Checkpoint System And Optimization Of Performance
8	For Grid Checkpoint Technology
9	Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce
10	Research On Checkpoint Technique Based On Cluster State