Font Size: a A A

Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint

Posted on:2009-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:X P LuFull Text:PDF
GTID:2178360278456779Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently, HPC systems, which reflect the overall science and technology level of a country, are implemented in many fields such as military, economy, science and technology, etc. With both the structure and the application scale becoming much huger, HPC systems tend to be greater and more complex. Consequently, the fault ratio rises exponentially, and the job time is markedly longer. Hence, it's significant that we study the fault tolerant technology in HPC systems to improve the availability.Aiming at the lack of fault tolerant technology in HPC systems, this paper studies the key fault tolerant technologies in resource management system, designs and implements the automatic jobs fault tolerant technology based on resource management system. The main works and innovations are as follows.1. The fault tolerant technology is an important way to improve the availability of HPC systems. Aiming at the function defect in recent fault tolerant function of HPC systems, this paper puts forward the automatic jobs fault tolerant framework based on recourse management system. Under the framework, the automatic jobs fault tolerant function is successfully implemented, and the availability and efficiency of HPC systems are improved.2. Aiming at the function defect in recent HPC systems, this paper studies recent fault detection technology, and puts forward a fault detection model based on node component in HPC systems. Compared with interrelated technology in recent HPC systems, the characteristic of this model is evaluated.3. Aiming at the lack of automatic job checkpoint function in recent HPC system, this paper studies recent parallel application checkpoint, designs and implements automatic CHECKPOINT/RESTART mechanism based on resource management system. With the automatic CHECKPOINT/RESTART function implemented by the mechanism, lots of resource waste caused by repeated computing is avoided. Meantime, the technique requirement is reduced for users to perform system management.4. With the NAS Parallel Benchmark, this paper evaluates the system in terms of both function and performance. The results indicate that the automatic faults detection function and jobs CHECKPOINT/RESTART function are implemented, and the overhead is low. So we get a conclusion that in our design, automatic fault tolerant function is implemented with low additional overhead, and the availability of HPC systems is remarkably improved.
Keywords/Search Tags:HPC, Fault Tolerance, ROC, SLURM, CHECKPOINT/RESART
PDF Full Text Request
Related items