Application-transparent fault management

Posted on:1995-01-17

Degree:Ph.D

Type:Dissertation

University:Carnegie Mellon University

Candidate:Russinovich, Mark Eugene

Full Text:PDF

GTID:1478390014490949

Subject:Engineering

Abstract/Summary:

As computers continue to proliferate and they are used in more demanding environments, data integrity and continuous availability are an increasingly important aspect of their designs. Since operating systems are common to all computers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies and the mechanisms that support them form an operating system's "fault management." A fault management mechanism, the sentry mechanism, has been designed and implemented for a UNIX 4.3 BSD server running on the Mach 3.0 microkernel. Fault tolerant policies have been designed for a range of computer systems, from a single computer, to mirrored computers to distributed systems. The policies first addressed provide single computer applications with application-transparent fault tolerance with respect to transient faults and certain types of permanent faults. Contributions to this area include algorithms for concurrent process journaling, disk checkpointing and memory checkpointing. Formal proofs are made of the journal sequencing algorithm and the disk checkpointing algorithm. Performance measurements from an implementation of the single computer algorithms show an average performance overhead of less than 5% and a requirement of only 10 MB of dedicated disk stable storage. The system provides fault tolerance with no additional hardware other than a hard disk, and works with unmodified applications such as the X-window system. Sentry policies that provide software based fault tolerance for duplicated and triplicated computer systems as well as distributed systems have also been designed. Contributions related to these policies include mirrored system synchronization, fault detection and integration algorithms. In addition, a new n-fault tolerant distributed recovery algorithm is presented that is based on loosely synchronized checkpointing. The algorithm journals message order information, instead of using the message content based journaling of existing algorithms. Only saving information on the order of messages can potentially result in lower space and time overheads. Two variants of the algorithm are presented and formally proven. In all three system designs the sentry mechanism provides sufficient control for the fault tolerant policies.

Keywords/Search Tags:

Fault, Policies, System, Computer

Related items

1	Research On Techniques Of Dissimilar Fault Tolerant Computer
2	Research On The U.S.High-performance Computer Export Control Policies(1993-2001)
3	The Design And Implementation Of TMR Fault Tolerant Computer
4	Research On Fault Diagnosis Of Computer Hardware System Based On Bayesian Network
5	Two-mode Fault-tolerant Computer Systems Research And Design,
6	Research On Fault - Tolerant Technology Of Cube - Star Spaceborne Computer System
7	Study And Realization Of Fault-tolerant Technology For Parallel Computer On Satellite
8	Research On The Policies And Regulations Of Sci-tech Periodicals Under The Background Of Into-enterprise System Reform In China
9	Design And Implementation Of Fault Injectors For High-End Fault-Tolerant Computer
10	Design And Implementation Of Fault Log Analysis System For High-Performance Fault-Tolerant Computer