Font Size: a A A

A VM-level Fault-Tolerant System For Virtual Clusters With Coordinated Checkpointing

Posted on:2011-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:M J ZhangFull Text:PDF
GTID:2178330338486101Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the increase in the size of the virtual cluster, the probability that some components in the virtual cluster will fail also rises up. The ability of fault tolerance is essential to the virtual cluster's availability, reliability and manageability. VirtCFT is an innovative and practical system of fault tolerance for virtual cluster which is implemented at virtual machine level with coordinated distributed checkpointing.VirtCFT coordinates the distributed VMs periodically to reach the globally consistent state and takes the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanism, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system to be protected from the failure of the physical machine on which it runs. Besides, with the virtualization technology of memory and IO, VirtCFT improves its performance and provides transparency by adopting incremental checkpoints and centralized control of networks of virtual machines underneath target virtual cluster.VirtCFT has been implemented based on the Xen virtualization technology. The daemons of user level are implemented in Python language. The kernel modules are implemented in C language. The function test indicates that VirtCFT has accomplished all the functions above and can provide fault tolerance to arbitrary virtual cluster and applications. The performance test shows that for compute-intensive benchmark, it only introduce no more than 30% run-time overhead comparing to fault-tolerance without coordination. Besides, the error recovery time is 4.51 seconds to 5.46 seconds.
Keywords/Search Tags:Fault tolerance, Virtual cluster, Coordinated checkpoint, High availability
PDF Full Text Request
Related items