Font Size: a A A

Application-transparent fault management

Posted on:1995-01-17Degree:Ph.DType:Dissertation
University:Carnegie Mellon UniversityCandidate:Russinovich, Mark EugeneFull Text:PDF
GTID:1478390014490949Subject:Engineering
Abstract/Summary:
As computers continue to proliferate and they are used in more demanding environments, data integrity and continuous availability are an increasingly important aspect of their designs. Since operating systems are common to all computers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies and the mechanisms that support them form an operating system's "fault management." A fault management mechanism, the sentry mechanism, has been designed and implemented for a UNIX 4.3 BSD server running on the Mach 3.0 microkernel. Fault tolerant policies have been designed for a range of computer systems, from a single computer, to mirrored computers to distributed systems. The policies first addressed provide single computer applications with application-transparent fault tolerance with respect to transient faults and certain types of permanent faults. Contributions to this area include algorithms for concurrent process journaling, disk checkpointing and memory checkpointing. Formal proofs are made of the journal sequencing algorithm and the disk checkpointing algorithm. Performance measurements from an implementation of the single computer algorithms show an average performance overhead of less than 5% and a requirement of only 10 MB of dedicated disk stable storage. The system provides fault tolerance with no additional hardware other than a hard disk, and works with unmodified applications such as the X-window system. Sentry policies that provide software based fault tolerance for duplicated and triplicated computer systems as well as distributed systems have also been designed. Contributions related to these policies include mirrored system synchronization, fault detection and integration algorithms. In addition, a new n-fault tolerant distributed recovery algorithm is presented that is based on loosely synchronized checkpointing. The algorithm journals message order information, instead of using the message content based journaling of existing algorithms. Only saving information on the order of messages can potentially result in lower space and time overheads. Two variants of the algorithm are presented and formally proven. In all three system designs the sentry mechanism provides sufficient control for the fault tolerant policies.
Keywords/Search Tags:Fault, Policies, System, Computer
Related items