High performance computing spare replacement hardware fault tolerance

Posted on:2005-05-23

Degree:Ph.D

Type:Dissertation

University:The University of New Mexico

Candidate:Dreicer, Jared Samuel

Full Text:PDF

GTID:1458390008998166

Subject:Computer Science

Abstract/Summary:

The use of spare replacement hardware and checkpoint rollback software fault tolerance on multiple-instruction-multiple-data (MIMD) architecture was investigated. New performance results are presented for spare node replacement after simulated failure and migration onto spare node prior to simulated failure. Spare replacement and migration onto spare were implemented for application parameter characterization runs on 32 nodes and scaling runs from 8--128 nodes on a MIMD cluster. The CUMULVS system was used for fault tolerant and control features. We evaluated the spare node replacement and migration onto spare node approaches using runtime to quantify performance and demonstrate viability of the approaches.; The principal new results of this study are that: (1) Spare node replacement provides good performance at a small cost in runtime; (2) Migration onto a spare provides even better performance at a small cost in runtime; and (3) A runtime breakeven point dependent on system scale is identified for both approaches relative to traditional approaches.; Results were quantified for empirical studies on 8--128 nodes. These studies investigated applications characterized by various computation-communication ratios, work patterns (steady, accumulate, disperse, hill, and hole), and various topologies (ring, one-to-all, and near neighbor). The decrease in the cost of commodity hardware enables strategies that can efficiently use a spare as a general means of dynamic redundancy. The gain resulting from these approaches is that because of decreased recovery time (given immediate access to a spare), the mean time to repair (MTTR) is reduced. Checkpoint and rollback overhead is still incurred, but for migration onto a spare, checkpoint overhead can be dramatically reduced. The scale of distributed memory MIMD architectures continue to grow as a result of user requests for greater performance, their increased computational requirements for finer resolution, and the decreasing cost of commodity hardware. However, these larger architectures experience an increasing frequency of component failures and subsequent loss of availability. Fault tolerance and availability are therefore important issues for high performance computing systems executing long-running applications. Our research indicates that utilizing spare replacement enhances scalability and availability of MIMD architectures and that further research will pay important dividends.

Keywords/Search Tags:

Spare, Performance, MIMD, Fault, Hardware

Related items

1	Spare Parts Catalog & Order-making System For Overseas Clients Of Commercial Vehicle Company
2	Architecture and performance of processor-memory interconnection networks for MIMD shared memory parallel processing systems
3	Analysis Of Hardware Fault Propagation In Programs And Research On Fault-tolerance Techniques
4	The Study Of Performance Testing Technique Based On Hardware Performance Monitoring
5	Infrared Moving Target Identification And Tracking System (dsp + Fpga) Hardware Design And Realization
6	The Design And Implementation Of Spare Parts Management System For Tongda Industrial Company
7	Research Fault Tolerance Methods Based On Evolvable Hardware
8	Software implemented hardware fault tolerance
9	Research Of On-chip Intelligent Self-recovery Technology Based On Evolutionary Hardware
10	Research On Fault Diagnosis Of Computer Hardware System Based On Bayesian Network