Font Size: a A A

Design And Implementation Of Distributed Multi-machine Fault-tolerant System

Posted on:2010-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiuFull Text:PDF
GTID:2178360302960399Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the increasing popularity of Internet applications, especially a large number of business services and large-scale information processing are being provided on the Internet, both the computer system processing power and availability have been put forward higher requirements. In order to avoid the downtime and service interruptions caused by computer failure, it is necessary to improve the stability and availability of the business processing system as much as possible. The most commonly used technology in high-availability is fault-tolerant technology. And cluster system, one of the most representative applications of fault-tolerant, has attracted more and more attention for its widely used in high availability.Under this application background, this thesis first does a deep research on the various key technologies in high-availability, fault-tolerance and cluster computing, and then summarizes the commonly problems the current cluster systems faced. To solve these problems, a distributed multi-node fault-tolerant system based on the exiting hardware and software lab resources is designed and implemented. The system not only has two fault-tolerant capabilities, which provides dual protection for both the tasks and the computing nodes, but also has a flexible task scheduling capabilities and a better performance in load balancing.The system adopts the distributed loosely coupled architecture to organize the nodes, which satisfies its extendibility. To overcome the high communication cost brought by the messages in distributed architecture, a group management model was constructed, which not only meets the system scalability but also greatly reduces the communication cost caused by the periodically messages between the nodes. Aimed at the NP hard problems for task scheduling, this thesis designs a distributed task scheduling model and a distributed consulting algorithm, which can reduce the complexity of the algorithm and improve the efficiency by dispatching most of the decision-making process to each node. Meanwhile, the algorithm is much more comprehensive with the consideration of a wide range of performance indicators, including the earliest task execution time, the communication between nodes, the load balancing and scheduling overhead and so on, to meet all aspects of performance requirements through a dynamic multi-objective scheduling according to the various state of tasks and systems. On this basis, a failure takeover algorithm is designed to redistribute the failure tasks or tasks on the failure node through the task scheduling algorithm, which not only ensures the overall system performance but also meets the high availability requirements of the system.
Keywords/Search Tags:High Availability, Fault-tolerant, Task scheduling, fault-recovery
PDF Full Text Request
Related items