Font Size: a A A

Design And Implementation Of Multi-machine Fault-tolerant System On Linux

Posted on:2008-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2178360242967552Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the pervasive application of computer technology and internet, people are relying on the computer system increasingly. Some vital operation system demands on computer system with high availability to insure against the continuity of application. So application system needs the ability to heal itself and run as long as possible to avoid the interruption of services which may be caused by some reason. For little scale application, dual-machine fault tolerant technology is commonly used. This kind of system has excellent ability to tolerate fault with small investment. With the increase of transaction, applications demand more powerful computed ability. Dual-machine will give way to multi-machine with excellent expansibility because it can not afford it. So this research has attracted many people's interest and people have invested in more efforts than erver before. For example, the OpenSource project LVS and LinuxHA who were developed by community are using widely in industry.Under this kind of application background, this paper proposes multi-machine fault-tolerant system which works on Linux. It affords two level of fault-tolerant function which provides protection for application process and computing node by the cooperation of all the computers. If the server processes exit abnormally, multi-machine fault-tolerant system will notice it and cooperate with other machines to take care of the failed service. In the same way when the heartbeat detection protocol notice that one has failed, the rest will run distributed selection algorithm to pick the agent who will completely take care of the failover and try to make the service available as soon as possible.This system uses loose coupling structure to organize the nodes, so it has great extension that means any node can join or leave the cooperation relationship any time as it wants. In order to complete detecting the invalidation of computing node quickly and accurately, this paper designs and implements heartbeat detection protocol especially that works in kernel. Because this protocol runs as network protocol entity and avoids the influence of process schedule which affects the application processes, it can detect node failover with shorter delay compared with implemented in user mode.
Keywords/Search Tags:High Availability, Heartbeat Detection, Fault-tolerant
PDF Full Text Request
Related items