Research On Failure Detection Of The Disaster Tolerance Storage Systems

Posted on:2009-07-20

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G Yang

Full Text:PDF

GTID:1118360272472365

Subject:Computer system architecture

Abstract/Summary:

With the development of distributed systems and network technology, the storage capacity, performance and scalability of distributed storage systems have increased rapidly. However, disaster tolerance storage systems are confronted with great challenges in fault tolerance. As storage systems scale up, they consist of thousands of storage nodes. Multiple failures will occur frequently, which would induce a disaster because of data loss. How to design a efficient and reliable fault tolerance mechanism has become a issue that needs to be resolved urgently in disaster tolerance storage systems.Failure detection is a key technology to implement a system of disaster tolerance. In this paper, we focus on problem related to failure detection. Firstly, we introduce distributed system model. In synchronous distributed systems, we can implement reliable and accurate failure detection. But in asynchronous distributed systems, we can not get reliable failure detection, because of the influence of message transmission delay and message loss.We identify problems for designing and implementing a scalable and generic failure detection service in a large-scale disaster tolerance storage system. Several approaches proposed in the literature were studied. Their effectiveness and limitations in addressing the identified problems have been discussed. More specifically, the following five fundamental problems that an ideal failure detector must address efficiently are identified: message saving, scalability, message loss, flexibility and dynamism. Each approach successfully addresses one or more of these problems but no approach provides a complete and satisfactory solution. Particularly, they are lacking in flexibility in the context of distributed storage systems. Thus it is required to provide an efficient failure detection service for large-scale disaster tolerance storage systems.Aiming to achieve accurate and efficient intelligent failure detector, the dissertation focuses on the research of message saving, scalability, message loss, flexibility and dynamism. The dissertation emphasizes particularly on the building of failure detector, the designing of failure detection algorithm, the adaptability to changing network conditions and the adaptability to the requirements of several applications. The main contents of the dissertation are as follows:1. The paper analyzes the features and new demands of the large-scale failure detection service aiming at the problems the large-scale failure detection encounters. It studies the present methods of realizing large-scale failure detection service. It studies the realization ways to the fundamental problems of failure detector and compares the advantages and shortcomings with various failure detectiong protocols.2. It designed the failure detection system according with the real environment of the disaster tolerance storage systems. It can realize the completeness and accuracy level of the failure detection and can effectively relieve the influence of various load producing. It can have the capability of fast and flexible failure detecting. In order to improve the extensibility of the failure detection system itself, the control node can produce a global view by way of notice.3. This implementation is a variant of the heartbeat failure detector which is adaptable. It dynamically estimates the heartbeat detection timeout and transmission delay of the system. It adapts to the change of the system state so as to reduce false detections. According to QoS metrics, we analyse it's performance.4. We provide a new concept of failure detector, called weight failure detector. A weight failure detector is an abstract entity which defines an interaction model and its properties. In fact, it outputs a weight value, which monotonically increases with elapsed time if the corresponding process has crashed. Therefore, the value is eventually initialized if the process is alive. Application query the failure detector module to get the weight value of the corresponding process. Each application has its own threshold, which reflects its requirement and which it uses to interpret the weight value using its own threshold.5. In the implementation of weight failure detector, the thresholdÎ¦that the application sets up can only portray the requirement of quality of service. However, most distributed applications have the strict constraint of time in practice. Therefore, failure detector need satisfy the exact and quantitative requirement of QoS according to the metrics of QoS. The implementation of weight failure detector must be required to assume that message behavior follows a normal distribution. A large-scale distributed storage system may not only have the high grade asynchronism, longer transmission delay and the high message loss rate, but also have a number of storage nodes that can be dynamic configuration. In this environment, message behavior can not follow always the certain specific distributed behavior. So we believe that failure detector may not have any assumptions as a general component. We provide a new failure detector, called QWFD. It can solve some conditions which weight failure detector must depend on and become more widespread in the failure detection services.

Keywords/Search Tags:

storage systems of disaster tolerance, failure detector, weight failure algorithm, QoS(Quality of Service)

Related items

1	Research And Design Of Failure Detector In Disastor Recovery Storage System Based On WAN
2	Failure Tolerance And Prediction For Storage Systems In Datacenters
3	Research On Qos-oriented Failure Detection Service In Distributed Systems
4	The Implementation On Failure Detector In Byzantine Fault Tolerance Replication System
5	Design And Implementation Of Peer Disaster-tolerance System In Cross-domain Virtual Private Cloud Interworking Scenario
6	Research And Implementation Of A Failure Detection System With Composite Structure
7	Research On Failure Detection And Data Migration Techniques In Intelligent Network Storage System
8	Research And Implement Of Monitoring Technology Of Military Information System On Disaster Tolerance
9	Designing And Implementation Of Failure Detector In Asynchronous Distributed Systems
10	The Research Of QoS-Aware Web Services Selection And Fault-Tolerance Of Runtime Service Composition