Font Size: a A A

Automatic recovery for request oriented systems

Posted on:2011-03-07Degree:Ph.DType:Dissertation
University:University of Illinois at Urbana-ChampaignCandidate:Lenharth, Andrew DavidFull Text:PDF
GTID:1448390002969434Subject:Computer Science
Abstract/Summary:
Gracefully recovering from software and hardware faults is important to ensuring highly reliable and available systems. Operating systems have privileged access to all aspects of system operation, thus a fault related to them is able to affect the entire system. Existing approaches to operating system recovery either do not protect the entire system or require a completely new operating system design.;This dissertation presents a new approach to fault recovery in operating systems called Recovery Domains. This approach allows recovery from unanticipated faults in commodity operating systems. Recovery is organized around the concept of a dynamic request. Operating system entry points initiate requests to perform some action. System calls, for example, are a request by an application to the operating system. When a fault is detected, the recovery system rolls back the effects of the offending recovery domain while leaving the remainder of the system running. To ensure that the entire system (including the state of other concurrent kernel threads) remains consistent after the rollback, dependencies between domains are tracked as the system runs. When rolling back a faulting domain, any other domains that were dependent on the it, because of dataflow between the domains, are rolled back and restarted.;Recovery Domains do not make faults transparent. Request failures are reported to the requester. This visibility allows handling of faults which are permanent: those faults which would reoccur if the request were retried. Recovery Domains also handle timing and transient faults.;Recovery Domains require compiler support to instrument the system. The necessary support is simple, but can cause unnecessarily large system overhead. This dissertation describes several performance improvements to Recovery Domains based on dynamic analysis of the system state and static analysis of memory regions, allocators, and locks. Runtime analysis of the interdependence of the active requests can allow reduced tracking of state changes. The recovery compiler can reason about memory regions and data structures protected by a lock to eliminate instrumentation on many operations to locked memory. "Fresh" heap objects, those objects which have been allocated and have not yet become visible to other requests and threads, require no instrumentation. These improvements to the recovery runtime and compiler provide substantial performance improvements over more simple implementations.;This dissertation describes the goals, approach, semantics, and programming model of Recovery Domains; the minimal implementation of the runtime and compiler; the static analysis and optimization at the compiler level and dynamic optimization to the runtime; and the porting of two significantly different versions of the Linux kernel to the recovery system. It evaluates the overhead, effectiveness, and coverage of recovery. Finally it describes the potential integration of a model fault detector with the Recovery Domains system.
Keywords/Search Tags:System, Recovery, Fault, Request
Related items