Providing application-aware reliability through OS/hypervisor-level techniques

Posted on:2011-04-29

Degree:Ph.D

Type:Thesis

University:University of Illinois at Urbana-Champaign

Candidate:Wang, Long

Full Text:PDF

GTID:2446390002969606

Subject:Engineering

Abstract/Summary:

Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery.;The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM.;The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value.;The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-muCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-muCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-muCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint.;The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.);Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-muCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-muCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals).;We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.

Keywords/Search Tags:

RMK, System, Providing

Related items

1	An Investigation Of Pension Situation In Village A
2	Research On The System Of Time Limit In Civil Providing Evidence
3	A Research On Capital Providing Of Corporations
4	A Comparative Study On The Quality Of Providing For The Aged In The System Of Providing For The Aged
5	Construction Of The Mode Of Household Support For The Aged In The Community Of Hefei City
6	A Study On The System Of Time Limit In Providing Evidence In Civil Litigation
7	A Research Of Legal Problem For The Pioviding Of Capital Form In Company Law
8	A Study On The Current Situation And Problems Of Providing For The Aged In Rural Areas Of Wafangdian City In China
9	Research On The Crime Of Providing Venues For Taking Drugs
10	The impact of strategic planning process variation on superior organizational performance in nonprofit human service organizations providing mental health services