Font Size: a A A

Proactive fault management in operational software systems

Posted on:1998-01-19Degree:Ph.DType:Thesis
University:Duke UniversityCandidate:Garg, SachinFull Text:PDF
GTID:2468390014475588Subject:Computer Science
Abstract/Summary:
This dissertation deals with proactive fault management, a technique for software fault tolerance, in operational software systems. The effectiveness of this technique lies in the fact that real-life software systems experience aging while in operation. Software aging refers to the degradation in the operational environment of a software system resulting in a gradual decrease in the performance or occasional crash/hang failures or both. Examples of aging can be found not only in software used on a mass scale but also in specialized software used in high-availability and safety-critical applications. Proactive fault management essentially involves performing "cleanup" operations occasionally on the running software. The cleanup counteracts the aging phenomenon by removing accrued errors. Specific forms of this technique, although referred to by different names such as "Software Rejuvenation", "Software Capacity Restoration", (Operational Redundancy" etc. already exist. The fundamental commonality among all the forms in particular and in proactive fault management in general is that it incurs an overhead because the software typically needs to be stopped. Therefore, an important research issue is to determine when and how often should it be performed.;The evaluation of the effectiveness of proactive fault management in operational software systems forms the core of this dissertation. We develop stochastic models for software systems which essentially trade off the cost of unexpected failures due to aging with the overhead of proactive fault management. We also propose and analyze two practical policies for proactive fault management. In each of the modeling exercises, the issue of optimal times to initiate the cleanup is addressed. Furthermore, the usefulness of proactive fault management is evaluated for software intended to run forever as well as for software which have finite failure free completion time.;The other major contribution of this dissertation lies in statistical validation of the hypothesis of existence of aging in general purpose UNIX machines. We describe the design and implementation of a distributed monitoring tool used to collect operating system resource usage and system activity data. We also present a methodology for estimating the age of the software system with respect to each individual resource.
Keywords/Search Tags:Software, Proactive fault management
Related items