Research On Performance Optimization Techniques In Fault Tolerant Distributed Systems

Posted on:2008-01-29

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Li

Full Text:PDF

GTID:1118360242999264

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the increasing amount of distributed computing systems applied in wide range of critical domains, the requirement of high reliability and high availability of distributed computing systems is becoming more and more urgent. Fault tolerance is the main means to ensure reliability and availability of applications, and it enables a system to provide its service even if some components of the system fail to work. However, implementing fault tolerance mechanisms in distributed systems based on commercial-off-the-shelf components usually reduces the performance of applications to a great extent. Therefore, the problem of performance optimization in distributed fault tolerant systems is widely studied.In this paper, we research on the key techniques of fault-tolerant computing systems, and our main goal is performance optimization, providing support for the development of a high performance fault-tolerant computing platform. We research two kinds of performance optimization techniques: algorithm optimization and architecture optimization. In terms of algorithm optimization, we focus on the total order broadcast problem and the optimistic synchronous replication problem. In terms of architecture optimization, we propose an extensible and adaptive fault-tolerance computing framework. The main contributions of this dissertation are as follows:1. We propose two optimized total order broadcast algorithms: ED algorithm and TDM algorithm. The ED algorithm is designed for static systems using unreliable failure detectors, and it takes advantages of optimistic assumption and piggy-back mechanism to enable messages to be delivered earlier, thus it reduces the communication delay. The TDM algorithm is designed for dynamic systems using group membership services. It combines the token-based algorithm and the deterministic merge algorithm together, and thus it can achieve both low latency and high throughput, and it is even more efficient under the bursty message arrival pattern.2. We propose an efficient replication algorithm AROA. The algorithm is based on active replication mode, but it uses the optimistic approach to reduce the response time. However, the AROA algorithm still ensures the consistency of replicas. Its main idea is: all replicas receive the client requests, and they perform the request processing task and the request ordering task concurrently. In most cases, requests are processed and are ordered in the same order, therefore the response time are reduced due to the concurrent execution. The AROA algorithm never returns the reply of a request to the client before the order of the request is confirmed by the ordering task, therefore, if the optimistic assumption does not hold, the algorithm can perform recovery task to ensure consistency. In addition, we propose combining the optimistic algorithm with the conservative one to avoid the negative effects of the optimistic one. 3. The research of the extensible fault tolerant computing framework. Most existing fault tolerant computing framework only provide some limited numbers of replication protocols, which are general and do not make use of the semantic knowledge of applications, therefore they are not the best choice for user applications. We propose a framework which allows users to develop their own replication protocols and plug them in. The framework is designed based on the reflection mechanism to simplify the development of the replication protocol. In addition, group-oriented remote procedure call primitives are provided to make the communication mechanism of replication protocol easy to implement.4. The research of the adaptive fault tolerance management mechanism. We add the adaptation mechanism to existing fault tolerance management frameworks, which allow the system to reconfigure dynamically to adapt to changes in the execution environment. The adaptive management mechanism optimizes the utilization of resources to improve the performance of fault-tolerant applications while assuring the reliability and availability at the same time.5. Design and implementation of a fault tolerant computing platform. Based on the studies on the key technologies stated above and the Starbus+ middleware developed by National University of Defense Technology, we propose a distributed fault tolerant computing platform named StarFT to support the development and management of fault tolerant applications.

Keywords/Search Tags:

Distributed Computing, Fault Tolerance, Reliability, Availability, Performance Optimization, Fault-Tolerant Computing Platform

PDF Full Text Request

Related items

1	The Study And Analysis On Fault-Tolerant Parallel Algorithm
2	Optimization Techniques Of Proactive Fault Tolerance For Large-scale High Performance Computing Systems
3	Research And Implementation Of A Fault-tolerance Evaluation Approach On The Fault-tolerant Prototype
4	Research And Implementation Of A Fault-Tolerance Evaluation Approach On The Fault-Tolerant Prototype
5	Research On Fault Tolerance Of High-performance Computing With NVRAM
6	Research And Design Of Fault Injection Platform For Cloud Computing System
7	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
8	Design And Implementation Of Distributed Stream Computing Framework Fault Tolerance
9	Algorithm Research On Multicast Routing With Fault-tolerance And High Reliability
10	Algorithm Research On Multicast Routing With Fault-Tolerance And High Reliability