A VM-level Fault-Tolerant System For Virtual Clusters With Coordinated Checkpointing

Posted on:2011-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:M J Zhang

Full Text:PDF

GTID:2178330338486101

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the increase in the size of the virtual cluster, the probability that some components in the virtual cluster will fail also rises up. The ability of fault tolerance is essential to the virtual cluster's availability, reliability and manageability. VirtCFT is an innovative and practical system of fault tolerance for virtual cluster which is implemented at virtual machine level with coordinated distributed checkpointing.VirtCFT coordinates the distributed VMs periodically to reach the globally consistent state and takes the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanism, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system to be protected from the failure of the physical machine on which it runs. Besides, with the virtualization technology of memory and IO, VirtCFT improves its performance and provides transparency by adopting incremental checkpoints and centralized control of networks of virtual machines underneath target virtual cluster.VirtCFT has been implemented based on the Xen virtualization technology. The daemons of user level are implemented in Python language. The kernel modules are implemented in C language. The function test indicates that VirtCFT has accomplished all the functions above and can provide fault tolerance to arbitrary virtual cluster and applications. The performance test shows that for compute-intensive benchmark, it only introduce no more than 30% run-time overhead comparing to fault-tolerance without coordination. Besides, the error recovery time is 4.51 seconds to 5.46 seconds.

Keywords/Search Tags:

Fault tolerance, Virtual cluster, Coordinated checkpoint, High availability

PDF Full Text Request

Related items

1	A VM-level Fault-Tolerant System For Virtual Clusters With Coordinated Checkpointing
2	Research On Rollback Recovery Fault-Tolerance Technology In High Availability Cluster
3	Research And Implementation Of Key Technologies On High-Availability Cluster System
4	Research On Checkpoint Technique Based On Cluster State
5	Research And Implementation Of The Checkpoint Technology In High Availability System
6	Research And Implementation Of PVM-based Cluster Fault-tolerance Method
7	Research On Checkpoint Subsystem For Linux SSI Cluster
8	Research Of Process Migration Mechanism Based On Checkpoint In Computational Grid
9	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
10	Research On Implementation Technologies Of Checkpoint System And Optimization Of Performance