Font Size: a A A

Research On Failure Detection Of The Disaster Tolerance Storage Systems

Posted on:2009-07-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:G YangFull Text:PDF
GTID:1118360272472365Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of distributed systems and network technology, the storage capacity, performance and scalability of distributed storage systems have increased rapidly. However, disaster tolerance storage systems are confronted with great challenges in fault tolerance. As storage systems scale up, they consist of thousands of storage nodes. Multiple failures will occur frequently, which would induce a disaster because of data loss. How to design a efficient and reliable fault tolerance mechanism has become a issue that needs to be resolved urgently in disaster tolerance storage systems.Failure detection is a key technology to implement a system of disaster tolerance. In this paper, we focus on problem related to failure detection. Firstly, we introduce distributed system model. In synchronous distributed systems, we can implement reliable and accurate failure detection. But in asynchronous distributed systems, we can not get reliable failure detection, because of the influence of message transmission delay and message loss.We identify problems for designing and implementing a scalable and generic failure detection service in a large-scale disaster tolerance storage system. Several approaches proposed in the literature were studied. Their effectiveness and limitations in addressing the identified problems have been discussed. More specifically, the following five fundamental problems that an ideal failure detector must address efficiently are identified: message saving, scalability, message loss, flexibility and dynamism. Each approach successfully addresses one or more of these problems but no approach provides a complete and satisfactory solution. Particularly, they are lacking in flexibility in the context of distributed storage systems. Thus it is required to provide an efficient failure detection service for large-scale disaster tolerance storage systems.Aiming to achieve accurate and efficient intelligent failure detector, the dissertation focuses on the research of message saving, scalability, message loss, flexibility and dynamism. The dissertation emphasizes particularly on the building of failure detector, the designing of failure detection algorithm, the adaptability to changing network conditions and the adaptability to the requirements of several applications. The main contents of the dissertation are as follows:1. The paper analyzes the features and new demands of the large-scale failure detection service aiming at the problems the large-scale failure detection encounters. It studies the present methods of realizing large-scale failure detection service. It studies the realization ways to the fundamental problems of failure detector and compares the advantages and shortcomings with various failure detectiong protocols.2. It designed the failure detection system according with the real environment of the disaster tolerance storage systems. It can realize the completeness and accuracy level of the failure detection and can effectively relieve the influence of various load producing. It can have the capability of fast and flexible failure detecting. In order to improve the extensibility of the failure detection system itself, the control node can produce a global view by way of notice.3. This implementation is a variant of the heartbeat failure detector which is adaptable. It dynamically estimates the heartbeat detection timeout and transmission delay of the system. It adapts to the change of the system state so as to reduce false detections. According to QoS metrics, we analyse it's performance.4. We provide a new concept of failure detector, called weight failure detector. A weight failure detector is an abstract entity which defines an interaction model and its properties. In fact, it outputs a weight value, which monotonically increases with elapsed time if the corresponding process has crashed. Therefore, the value is eventually initialized if the process is alive. Application query the failure detector module to get the weight value of the corresponding process. Each application has its own threshold, which reflects its requirement and which it uses to interpret the weight value using its own threshold.5. In the implementation of weight failure detector, the thresholdΦthat the application sets up can only portray the requirement of quality of service. However, most distributed applications have the strict constraint of time in practice. Therefore, failure detector need satisfy the exact and quantitative requirement of QoS according to the metrics of QoS. The implementation of weight failure detector must be required to assume that message behavior follows a normal distribution. A large-scale distributed storage system may not only have the high grade asynchronism, longer transmission delay and the high message loss rate, but also have a number of storage nodes that can be dynamic configuration. In this environment, message behavior can not follow always the certain specific distributed behavior. So we believe that failure detector may not have any assumptions as a general component. We provide a new failure detector, called QWFD. It can solve some conditions which weight failure detector must depend on and become more widespread in the failure detection services.
Keywords/Search Tags:storage systems of disaster tolerance, failure detector, weight failure algorithm, QoS(Quality of Service)
PDF Full Text Request
Related items