Font Size: a A A

High performance computing spare replacement hardware fault tolerance

Posted on:2005-05-23Degree:Ph.DType:Dissertation
University:The University of New MexicoCandidate:Dreicer, Jared SamuelFull Text:PDF
GTID:1458390008998166Subject:Computer Science
Abstract/Summary:
The use of spare replacement hardware and checkpoint rollback software fault tolerance on multiple-instruction-multiple-data (MIMD) architecture was investigated. New performance results are presented for spare node replacement after simulated failure and migration onto spare node prior to simulated failure. Spare replacement and migration onto spare were implemented for application parameter characterization runs on 32 nodes and scaling runs from 8--128 nodes on a MIMD cluster. The CUMULVS system was used for fault tolerant and control features. We evaluated the spare node replacement and migration onto spare node approaches using runtime to quantify performance and demonstrate viability of the approaches.; The principal new results of this study are that: (1) Spare node replacement provides good performance at a small cost in runtime; (2) Migration onto a spare provides even better performance at a small cost in runtime; and (3) A runtime breakeven point dependent on system scale is identified for both approaches relative to traditional approaches.; Results were quantified for empirical studies on 8--128 nodes. These studies investigated applications characterized by various computation-communication ratios, work patterns (steady, accumulate, disperse, hill, and hole), and various topologies (ring, one-to-all, and near neighbor). The decrease in the cost of commodity hardware enables strategies that can efficiently use a spare as a general means of dynamic redundancy. The gain resulting from these approaches is that because of decreased recovery time (given immediate access to a spare), the mean time to repair (MTTR) is reduced. Checkpoint and rollback overhead is still incurred, but for migration onto a spare, checkpoint overhead can be dramatically reduced. The scale of distributed memory MIMD architectures continue to grow as a result of user requests for greater performance, their increased computational requirements for finer resolution, and the decreasing cost of commodity hardware. However, these larger architectures experience an increasing frequency of component failures and subsequent loss of availability. Fault tolerance and availability are therefore important issues for high performance computing systems executing long-running applications. Our research indicates that utilizing spare replacement enhances scalability and availability of MIMD architectures and that further research will pay important dividends.
Keywords/Search Tags:Spare, Performance, MIMD, Fault, Hardware
Related items