Font Size: a A A

Design And Implementation Of Cluster Fault-tolerant System

Posted on:2009-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2178360272470527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the reaserch of the high performace of the computer, the first problem is how to keep the reliable and availability of the computer. Cluster is the mainstream architecture for high performance computing because of its low cost and good scalability. The loose-coupling architecture between nodes makes cluster system easier to implement high available than centralized system. But with the scale of cluster system become more and more large, some new problems are brought along. The purpose of this thesis is to increase the availability of the cluster. This paper proposes a cluster fault-tolerant system. The system includes four moudules: user module, center moudule, process control's module and heartbeat moudule. Four moudules cooperate to accomplish the functionality. There has detailed introduction in this paper. This system uses loose coupling structure to organize the nodes. It can heal itself and run as long as possible to avoid the interruption of services which may be caused by some reason. This system has great extension that means any node can join or leave the cooperation relationship any time as it wants.It affords two levels of fault-tolerant. Heartbeat mechanism is the most common technology to achieve the reliavle communication of the high availability system. In order to complete detecting the invalidation of computing node quickly and accurately, this paper designs a new real-time heartbeat which can dynamically link into the linux kernel. It can avoid the influence of process schedule and detect node failover with shorter delay compared with implemented in user mode. This paper use netlink connector to detect the failure of the process. The exit of process is looked as abnormal unless it was not under inspected. When the heartbeat detection protocol notice that one has failed, the rest will run distributed selection algorithm to pick the agent who will completely take care of the failover. It restarts the process to keep the avalibility of system. The availiability and robustness of the system are improved to a certain extent.
Keywords/Search Tags:High Availability, Heartbeat Detection, Fault-tolerant
PDF Full Text Request
Related items