Font Size: a A A

Providing application-aware reliability through OS/hypervisor-level techniques

Posted on:2011-04-29Degree:Ph.DType:Thesis
University:University of Illinois at Urbana-ChampaignCandidate:Wang, LongFull Text:PDF
GTID:2446390002969606Subject:Engineering
Abstract/Summary:
Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery.;The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM.;The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value.;The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-muCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-muCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-muCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint.;The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.);Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-muCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-muCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals).;We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.
Keywords/Search Tags:RMK, System, Providing
Related items