Font Size: a A A

Research On Fault-tolerant Mechanism For SSI Cluster

Posted on:2006-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2178360185463271Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cluster has become the main architecture for high-performance computing, single system image cluster are being applied to enterprise computing at present. However, due to being built with COTS, clusters always have short MTBF, node failure always happens during cluster running. Efficiently handing node failure is significant for single system image cluster, the relevant research has attracted many researchers.Our research is based on Kerrighed, which is a single system image cluster operating system. Kerrighed has excellent single system image properties and high performance. However, it has not implemented dynamic fault-tolerant at present. Firstly we study the common technologies for fault-tolerant and the mathematical model for reliability evaluation, we also analyze the internal structure of Kerrighed and other typical cluster operating system, then abstract the distributed service model. Since node failure is a common event for cluster operating system to handle, we try to support fault-tolerant through the internal structure of the cluster operating system. So that the paper presents the concept of dynamic configuration management layer to support fault-tolerant, dynamic configuration management and node failure handling, the reliability of maintainable fault-tolerant cluster is also analyzed by the Markov model.On another hand, fault-tolerant mechanism depends on the quick detection of node failure, node failure detection is always performed by heartbeat protocol. We present a new heartbeat protocol named Heartbeat Ring, which adapts to large-scale distributed cluster and has the advantage in low message complexity, low resource-consuming and scalability, and so on. Finally based on a prototype, the basic fault-tolerant of Kerrighed has been implemented and the experiment result is presented.
Keywords/Search Tags:Cluster, Distributed System, Single System Image, Fault-tolerant, Heartbeat
PDF Full Text Request
Related items