Font Size: a A A

Research Of Fault Containment And Checkpointing Technology For Shared-Memory Multiprocessor

Posted on:2007-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:N YuanFull Text:PDF
GTID:2178360215970418Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of HPC, the scale of computer system becomes larger andlarger, and the architecture is more complex. It enormously increases the chance ofhardware failure and the difficulty of failure handling. However, high reliability andhigh availability are the basic requirements of supercomputer. So it is important to makean effective method of fault-tolerance for supercomputer system. In this dissertation, wefocus on the shared memory supercomputers which are suffering from an inherentfragility, a single hardware or system software failure can cause the entire machine tocrash and it is hard to solve those kinds of failures for designers.This dissertation introduces the theories of fault-tolerance and fault-containment ofcomputer, researches the architecture of shared memory computer, the runtime featureof RAS system and the behavioral characteristic and classification of failures.According to the relational study of node-level fault-containment technology whichdeals with the fault in the unit of nodes, this dissertation researches the method thatpartitions the fault-containment region base on the process. In view of the weakness offault containment, make it combines with checkpointing technology, which meansconfine the area failure could spread with fault containment technology and recover thefailed processes with checkpointing technology. To improve the efficiency ofcheckpointing technology, provide a kind of dynamic checkpointing technology, andtest it with RSIM. The result shows that the dynamic checkpointing technology can dowell like the traditional one with less overhead, which will surely improve theavailability of the system. My research makes following contributions:...
Keywords/Search Tags:shared memory, fault tolerance, fault containment, RAS system, checkpoint, RSIM
PDF Full Text Request
Related items