Font Size: a A A

Operating System Support for Resilienc

Posted on:2012-03-01Degree:Ph.DType:Thesis
University:Dartmouth CollegeCandidate:McGill, KathleenFull Text:PDF
GTID:2458390011955400Subject:Computer Engineering
Abstract/Summary:
The notion of resiliency is concerned with constructing applications that are able to operate through a wide variety of computer failures and attacks. Several approaches have been proposed to provide fault tolerance through the replication of resources. In general, these approaches provide graceful degradation of performance to the point of failure but do not guarantee progress in the presence of multiple cascading and recurrent failures. The proposed approach dynamically replicates processes, detects inconsistencies in their behavior, and restores the level of resiliency as the computation proceeds, so that failures have no long term effect on applications. This thesis introduces a collection of novel operating system technologies that provide applications with automated, transparent, and scalable resilience. Resiliency mechanisms and policies are explored in a resilient message-passing technology, rMP.;A Linux rMP prototype implements a message-passing API through kernel-level communication that provides the underlying resiliency mechanisms: process replication, adaptive failure detection, and dynamic process regeneration. An innovative approach to adaptive failure detection uses locality within replicated processes as a basis to detect anomalies in message delay during group communication. Replication and migration mechanisms provide transparent regeneration of message-passing processes without halting the application or executing global coordination protocols.;Resiliency policies are explored through evaluation of alternative algorithms for distributed process management. A new algorithm, DIFFUSE, is introduced, inspired by the notions of heat diffusion and robotic swarming. Heat diffusion is emulated to disseminate processes across a scalable multicomputer architecture. Robotic swarming techniques are used to maintain locality between resilient processes while balancing load. The algorithm's performance is compared to competing algorithms using a set of benchmarks that capture the primary attributes of the process management problem. DIFFUSE outperforms competing algorithms and integrates the goals of load-balancing and resilience within a single strategy.
Keywords/Search Tags:Resiliency, Process
Related items