Font Size: a A A

Research On Performance Optimization Techniques In Fault Tolerant Distributed Systems

Posted on:2008-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LiFull Text:PDF
GTID:1118360242999264Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasing amount of distributed computing systems applied in wide range of critical domains, the requirement of high reliability and high availability of distributed computing systems is becoming more and more urgent. Fault tolerance is the main means to ensure reliability and availability of applications, and it enables a system to provide its service even if some components of the system fail to work. However, implementing fault tolerance mechanisms in distributed systems based on commercial-off-the-shelf components usually reduces the performance of applications to a great extent. Therefore, the problem of performance optimization in distributed fault tolerant systems is widely studied.In this paper, we research on the key techniques of fault-tolerant computing systems, and our main goal is performance optimization, providing support for the development of a high performance fault-tolerant computing platform. We research two kinds of performance optimization techniques: algorithm optimization and architecture optimization. In terms of algorithm optimization, we focus on the total order broadcast problem and the optimistic synchronous replication problem. In terms of architecture optimization, we propose an extensible and adaptive fault-tolerance computing framework. The main contributions of this dissertation are as follows:1. We propose two optimized total order broadcast algorithms: ED algorithm and TDM algorithm. The ED algorithm is designed for static systems using unreliable failure detectors, and it takes advantages of optimistic assumption and piggy-back mechanism to enable messages to be delivered earlier, thus it reduces the communication delay. The TDM algorithm is designed for dynamic systems using group membership services. It combines the token-based algorithm and the deterministic merge algorithm together, and thus it can achieve both low latency and high throughput, and it is even more efficient under the bursty message arrival pattern.2. We propose an efficient replication algorithm AROA. The algorithm is based on active replication mode, but it uses the optimistic approach to reduce the response time. However, the AROA algorithm still ensures the consistency of replicas. Its main idea is: all replicas receive the client requests, and they perform the request processing task and the request ordering task concurrently. In most cases, requests are processed and are ordered in the same order, therefore the response time are reduced due to the concurrent execution. The AROA algorithm never returns the reply of a request to the client before the order of the request is confirmed by the ordering task, therefore, if the optimistic assumption does not hold, the algorithm can perform recovery task to ensure consistency. In addition, we propose combining the optimistic algorithm with the conservative one to avoid the negative effects of the optimistic one. 3. The research of the extensible fault tolerant computing framework. Most existing fault tolerant computing framework only provide some limited numbers of replication protocols, which are general and do not make use of the semantic knowledge of applications, therefore they are not the best choice for user applications. We propose a framework which allows users to develop their own replication protocols and plug them in. The framework is designed based on the reflection mechanism to simplify the development of the replication protocol. In addition, group-oriented remote procedure call primitives are provided to make the communication mechanism of replication protocol easy to implement.4. The research of the adaptive fault tolerance management mechanism. We add the adaptation mechanism to existing fault tolerance management frameworks, which allow the system to reconfigure dynamically to adapt to changes in the execution environment. The adaptive management mechanism optimizes the utilization of resources to improve the performance of fault-tolerant applications while assuring the reliability and availability at the same time.5. Design and implementation of a fault tolerant computing platform. Based on the studies on the key technologies stated above and the Starbus+ middleware developed by National University of Defense Technology, we propose a distributed fault tolerant computing platform named StarFT to support the development and management of fault tolerant applications.
Keywords/Search Tags:Distributed Computing, Fault Tolerance, Reliability, Availability, Performance Optimization, Fault-Tolerant Computing Platform
PDF Full Text Request
Related items