Font Size: a A A

Power-aware resilience for exascale computing

Posted on:2015-11-27Degree:Ph.DType:Thesis
University:University of PittsburghCandidate:Mills, BryanFull Text:PDF
GTID:2478390020952778Subject:Computer Science
Abstract/Summary:
To enable future scientific breakthroughs and discoveries, the next generation of scientific applications will require exascale computing performance to support the execution of predictive models and analysis of massive quantities of data, with significantly higher resolution and fidelity than what is possible within existing computing infrastructure. Delivering exascale performance will require massive parallelism, which could result in a computing system with over a million sockets, each supporting many cores. Resulting in a system with millions of components, including memory modules, communication networks, and storage devices. This increase in number of components significantly increases the propensity of exascale computing systems to faults, while driving power consumption and operating costs to unforeseen heights. To achieve exascale performance two challenges must be addressed: resilience to failures and adherence to power budget constraints. These two objectives conflict insofar as performance is concerned, as achieving high performance may push system components past their thermal limit and increase the likelihood of failure. With current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication.;In this thesis, we present a new fault tolerance model called shadow replication that addresses resilience and power simultaneously. Shadow replication associates a shadow process with each main process, similar to traditional replication, however, the shadow process executes at a reduced speed. Shadow replication reduces energy consumption and produces solutions faster than checkpoint/restart and other replication methods in limited power environments. Shadow replication reduces energy consumption up to 25% depending upon the application type, system parameters, and failure rates. The major contribution of this thesis is the development of shadow replication, a power-aware fault tolerant computational model. The second contribution is an execution model applying shadow replication to future high performance exascale-class systems. Next, is a framework to analyze and simulate the power and energy consumption of fault tolerance methods in high performance computing systems. Lastly, to prove the viability of shadow replication an implementation is presented for the Message Passing Interface.
Keywords/Search Tags:Computing, Shadow replication, Exascale, Performance, Power, Resilience, Systems
Related items