Font Size: a A A

Runtime systems for load balancing and fault tolerance on distributed systems

Posted on:2015-12-18Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Arafat, Md HumayunFull Text:PDF
GTID:1478390017993712Subject:Computer Science
Abstract/Summary:
Exascale computing creates many challenges for scientific applications in both hardware and software. There is a continuous need for adaption to new architectures. Load balancing and data distribution are major issues in increasingly large machines. In addition, fault tolerance has to be considered in every aspect of the system. In this dissertation, we make contributions to advance parallel computing, load balancing and fault tolerance in the context of scientific applications.;The dynamical nucleation theory Monte Carlo (DNTMC) application from the NWChem computational chemistry suite utilizes a Markov chain Monte Carlo, two-level parallel structure, with periodic synchronization points that assemble the results of independent finer-grained calculations. Like many such applications, the existing code employs a static partitioning of processes into groups and assigns each group a piece of the finer-grained parallel calculation. A significant cause of performance degradation is load imbalance among groups since the time requirements of the inner-parallel calculation varies widely with the input problem and as a result of the Monte Carlo simulation. We present a novel approach to load balancing such calculations with minimal changes to the application. We introduce the concept of a resource sharing barrier (RSB) -- a barrier that allows process groups waiting on other processes' work to actively contribute to their completion.;The next work presents an approach to accelerate task parallel computations using GPUs in the context of the Global Arrays parallel programming model. Task parallelism is an efficient technique for expressing parallelism in irregular programs. We extend the Scioto task parallel scheduling framework to efficiently offload task execution to GPU accelerators. The execution of Scioto tasks on a GPU requires movement of data through three layers: the global address space, host memory, and device memory. We propose an automated, pipeline-based approach for handling the movement of data through these memory spaces. Data transfer is made transparent to the user, providing opportunities to hide overheads through optimizations like pipelining. On-device caching and task sequencing are also leveraged to exploit data locality.;The increase in the number of processors needed to build exascale systems implies that the mean time between failure will further decrease, making it increasingly important to develop scalable techniques for fault tolerance. We develop an efficient checksum-based approach to fault tolerance for data in volatile memory systems, i.e., an approach without the need to save any data on stable persistent storage. The developed scheme is applicable in multiple scenarios, including: 1) online recovery of large read-only data structures from the memory of failed nodes, with very low storage overhead 2) online recovery from soft errors in blocked data, and 3) online recovery of read/write data via in-memory checkpointing.
Keywords/Search Tags:Fault tolerance, Load balancing, Data, Online recovery, Memory, Systems
Related items