Font Size: a A A

Research Of Hardware Fault-containment In Intra-node Of CC-NUMA Multiprocessor

Posted on:2006-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:D F LiuFull Text:PDF
GTID:2178360185463804Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of high-performance computing, requirements to high reliability and high availability of supercomputer are increasing. The scale of new machine is bigger and bigger, the architecture become more and more complex, all of these greatly increase the opportunity of hardware fault, and make it hard to deal with the fault. Especially for the CC-NUMA shared memory supercomputers which are suffering from an inherent fragility, a single hardware or system software failure can cause the entire machine to crash and it is hard to solve those kinds of failure for the hardware and software designers.This dissertation detailedly introduces the theory of fault-tolerance and fault-containment of computer, carefully studys the architecture of CC-NUMA computer and the operational feature of Cache coherence in fault. Acorrding to the relational study of fault-containment of node level which deal with the fault in the unit of node, this dissertation suppose the concept of fault-containment of module level which deal with the fault in the unit of intra-node's module.Then suppose a mathod that partitions the fault-containment region base on the process. I design and implement the method of fault-containment and recovery arithmetic, effectively solve the problem of fault in CC-NUMA computer. This method have better ability of dealing with the fault and better flexibility.My research makes the following primary contributions: (1)I analyse the feature of fault of CC-NUMA system, and constuct fault model, (2) I study the feature of RAS system of supercomputer, and how to use it to solve the fault of CC-NUMA supercomputer , (3) Base on the feature of CC-NUMA's architecture,I design the Fault-Containment Region, (4) I analyse what kind of influence that all kind of fault make to the system, (5) I suppose a recovery arithmetic,that can make the system back to an operationable state, (6) I validate the correctness of module and analyse the expansibility of system.
Keywords/Search Tags:CC-NUMA, supercomputer, fault-containment, RAS system
PDF Full Text Request
Related items