Font Size: a A A

Practical on-line diagnosis in distributed system

Posted on:1995-08-15Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:Buskens, Richard WayneFull Text:PDF
GTID:2472390014990321Subject:Electrical engineering
Abstract/Summary:PDF Full Text Request
Tasks in a distributed system rely on a greater number of system components for completion than in a uniprocessor, typically to increase performance. As the number of needed components increases, the likelihood that successful task operation is jeopardized by a single failure increases, unless the effects of the failure can be isolated. Fault tolerance is introduced into a distributed system to provide the necessary isolation. Historically, fault tolerance solutions for distributed systems rely on fault masking, a high cost technique that requires preallocated redundant resources even if no faulty components are present. An alternative to fault masking is fault detection, diagnosis, and recovery, where redundant resources inherent in a distributed system are dynamically allocated only when fault components are present. This technique achieves fault tolerance at relatively low cost. This thesis examines the fault diagnosis problem for distributed systems, known as the distributed system-level diagnosis problem.;New distributed system-level diagnosis algorithms suitable for implementation in real systems are presented. The algorithms are theoretically rigorous, make practical assumptions about the operating environment, and execute without disrupting normal system operation. The Adaptive DSD distributed diagnosis algorithm assumes restrictive faulty component behavior and requires the provably minimum operational overhead of any on-line distributed diagnosis algorithm. The Robust distributed diagnosis algorithm operates correctly in the presence of arbitrary failures, but requires higher overhead. To verify the practicality of the modelling assumptions, Adaptive DSD was implemented in a local area network of UNIX workstations. The results of this thesis demonstrate that distributed system-level diagnosis theory is practical for application to real distributed systems. The use of fault detection, diagnosis, and recovery techniques is expected to play a significant role in the construction of fault-tolerant distributed systems.
Keywords/Search Tags:Distributed, Diagnosis, Fault, Practical, Components
PDF Full Text Request
Related items