Design And Implementation Of Distributed Multi-machine Fault-tolerant System

Posted on:2010-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:J J Liu

Full Text:PDF

GTID:2178360302960399

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the increasing popularity of Internet applications, especially a large number of business services and large-scale information processing are being provided on the Internet, both the computer system processing power and availability have been put forward higher requirements. In order to avoid the downtime and service interruptions caused by computer failure, it is necessary to improve the stability and availability of the business processing system as much as possible. The most commonly used technology in high-availability is fault-tolerant technology. And cluster system, one of the most representative applications of fault-tolerant, has attracted more and more attention for its widely used in high availability.Under this application background, this thesis first does a deep research on the various key technologies in high-availability, fault-tolerance and cluster computing, and then summarizes the commonly problems the current cluster systems faced. To solve these problems, a distributed multi-node fault-tolerant system based on the exiting hardware and software lab resources is designed and implemented. The system not only has two fault-tolerant capabilities, which provides dual protection for both the tasks and the computing nodes, but also has a flexible task scheduling capabilities and a better performance in load balancing.The system adopts the distributed loosely coupled architecture to organize the nodes, which satisfies its extendibility. To overcome the high communication cost brought by the messages in distributed architecture, a group management model was constructed, which not only meets the system scalability but also greatly reduces the communication cost caused by the periodically messages between the nodes. Aimed at the NP hard problems for task scheduling, this thesis designs a distributed task scheduling model and a distributed consulting algorithm, which can reduce the complexity of the algorithm and improve the efficiency by dispatching most of the decision-making process to each node. Meanwhile, the algorithm is much more comprehensive with the consideration of a wide range of performance indicators, including the earliest task execution time, the communication between nodes, the load balancing and scheduling overhead and so on, to meet all aspects of performance requirements through a dynamic multi-objective scheduling according to the various state of tasks and systems. On this basis, a failure takeover algorithm is designed to redistribute the failure tasks or tasks on the failure node through the task scheduling algorithm, which not only ensures the overall system performance but also meets the high availability requirements of the system.

Keywords/Search Tags:

High Availability, Fault-tolerant, Task scheduling, fault-recovery

PDF Full Text Request

Related items

1	Research On Recovery-Oriented Fault-Tolerant Computing Technique
2	Distributed File System Level Fault-tolerant Mechanism
3	Research On Clustering-based Task Fault-tolerant Scheduling Of Ad Hoc Network
4	Modeling Of Fault Diagnosis And Recovery Function Of Fault-tolerant System
5	Design And Implementation Of Multi-machine Fault-tolerant System On Linux
6	Design And Implementation Of Cluster Fault-tolerant System
7	Research On Fault Recovery Techniques For Soft Errors Of COTS DSP
8	Research On The Task Fault-tolerant Scheduling Optimization Algorithms For The Distributed Real-Time System
9	Research On Virtual Machine Based Fast Fault Recovery Technology
10	Fault-Tolerant Task Scheduling Algorithms For Real-Time Systems Based On ICM Model