Font Size: a A A

Application-Aware On-Line Failure Recovery For Extreme-Scale HPC Environment

Posted on:2018-06-01Degree:Ph.DType:Dissertation
University:Rutgers The State University of New Jersey - New BrunswickCandidate:Balmana, Marc GamellFull Text:PDF
GTID:1448390002450920Subject:Electrical engineering
Abstract/Summary:
High Performance Computing (HPC) brings with it the promise of deeper insight into complex phenomena through the execution of various extreme-scale applications, especially those in the fields of science and engineering. The increasing computational demands of these applications continue to push the limits of current extreme scale HPC systems. As a result, the community is working toward achieving exascale systems able to compute 1018 floating point operations per second (FLOPS). Since these systems are expected to contain a large number of components, reliability is one of the key anticipated challenges. Due to the extensive periods of time that complex applications require, future systems will likely see an increase in process and node failures during application execution. These failures, also known as hard failures, are currently handled by terminating the execution and restarting it from the last stored checkpoint. This checkpoint-restart methodology requires the application to periodically save its distributed state into a centralized, stable storage --an approach that is not expected to scale to future extreme-scale systems. While the illusion of a failure-free machine --implemented either via hardware or system software strategies-- is adequate for current HPC systems, they may prove too costly in future extreme-scale machines. Resilience is, therefore, a key challenge that must be addressed in order to realize the exascale vision.;This dissertation explores new models that leverage application-awareness to enable on-line failure recovery. On-line recovery, which does not require the interruption of surviving processes in order to collectively restart the entire application, offers better cost/performance tradeoffs by reducing recovery overheads. Recovering processes on-line enables application-specific data recovery strategies and optimized in-memory checkpointing while avoiding the repetition of initialization procedures --the least optimized part of most production-level applications- on all processes.;This dissertation addresses three areas of research in on-line failure recovery. First, it explores a generic global on-line recovery model, involving all processes in the recovery process. Second, it explores optimized local recovery in which communication characteristics of certain application classes are leveraged to reduce overheads due to failure. In particular, finite difference partial differential equation solvers using stencil operators are used as the driving application class. Third, this dissertation demonstrates how the overhead of multiple, independent failures can be masked to effectively reduce the impact on total execution time. The models presented in this dissertation are implemented and evaluated in Fenix and FenixLR, a pair of generic and extensible frameworks used to demonstrate the concepts.
Keywords/Search Tags:HPC, On-line failure recovery, Application, Extreme-scale, Execution
Related items