Font Size: a A A

Study Of Fault Tolerance Checkpoint Algorithm In Distributed System And Software Design

Posted on:2011-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y N WangFull Text:PDF
GTID:2178360305451573Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid development of computer network technologies and distributed System for the continuous expansion of application, Distributed applications for high reliability and availability requirements need more and more urgent. A high-availability system requirements for service will not under any circumstances be interrupted and can provide the correct services, therefore, Therefore, fault-tolerant in distributed systems is hot spot and difficulties for distributed systems, at the same time, is also a challenging topics. As a distributed system in the geographical has the extensive distribution, which cause the whole system that is unable to obtain a unified clock, Which is in turn directly affect the key factor in the performance of distributed systems. This provides a great challenge in distributed systems fault-tolerant problems. In this paper, we use Natural Science of Shandong Province Fund as Project background, applying fault-tolerant checkpointing algorithm in distributed systems as a starting point, For the rich Checkpointing setting Strategy and the purpose of improve resource utilization and system efficiency, we carry on in-depth exploration and research in distributed systems fault-tolerant and checkpointing algorithm issues.In this paper, we study the fault-tolerant in distributed systems and checkpoint setting-up and recovery algorithms, setting theory, methods and techniques of Fault-tolerant checkpointing interval. Exploring the actual combination of fault-tolerant checkpointing algorithm with distributed fault-tolerant software. Improving the availability of distributed systems and usefulness of the algorithm. This paper makes research on the following subjects:(1)Constructing a hierarchy of distributed fault-tolerant system model and listing the main features of the model, the concept of fault-tolerant and related theorems, furthermore, we analyzes the performance characteristics of distributed systems. further, Distributed fault-tolerant system arise the checkpoint communication problems due to adaptive clock characteristics, such as Orphan message, In-transit messaging el. We descripted how to eliminate these non-global consistent checkpoint state conditions and the theorem, given checkpointing algorithm performance evaluation of strengths and weaknesses of the indicators, namely, the algorithm's time overhead and space overhead.(2)According to the checkpointing algorithm for distributed fault-tolerance principle, we analysis the characteristic among synchronous checkpointing algorithm settings, Asynchronous Checkpointing setting and Checkpointing algorithm for message log,deeply understanding the impact of checkpointing algorithm performance bottlenecks in factor. Algorithm design complicated, poor usability, larger space overhead, these have the great impact of distributed systems performance. On this basis, we provide an improved algorithm for fault-tolerant distributed checkpoint; matrix consistent checkpointing algorithm, the algorithm starts from the basic elements of fault-tolerant inter-process communication, and uses the number of inter-process communication as a core idea of the design algorithm. The algorithm reduces the overhead of time and space, improving the overall performance of the system. During deductive reasoning prove, the checkpointing algorithm is simple and effective.(3)Through studying the checkpoint algorithm performance bottleneck, we found that the checkpoint interval settings and select the check-point for the performance of the algorithm, which had a significant impact. In the setting checkpoint intervals algorithm, It is significant to minimize the time and space overhead caused by checkpoint mechanism for enhance Computational efficiency. Under the existence of dynamic checkpoint intervals, a novel HSMM dynamic checkpoint intervals algorithm has been presented, named Markov dynamic checkpoint intervals exploited random time series analysis method. The method not only decreases the duplicate computing time in roll back, and lets down the necessary storage space for conserving processes states by checkpoints. Simulation results show that the proposal method in this paper, compared with fixed checkpoint intervals and normal dynamic checkpoint intervals compared, can reduce average checkpoint intervals overhead 1.019% to traditional strategies.(4)Distributed fault-tolerant software platform for the realization of technology is very important. Because of cross-platform characteristics of distributed systems, In this paper, we design the two versions in Windows and Linux operating system platforms under the fault-tolerant software. The former is mainly used Detour Windows API Technology, through the process of inserting in the system with fault-tolerant features thread achieve fault tolerance under the Windows system, and also can save the file under the fault-tolerant recovery. The latter mainly uses the Linux LKM module technology, we insert a LKM module with a fault-tolerant feature in Linux kernal, and carry out checkpointing and fault-tolerant recovery under the Linux system, the fault-tolerant software provides an effective support to distributed systems availability, it has a certain practical applications value.
Keywords/Search Tags:distributed system, fault-tolerant, checkpoint algorithm, Markov decision, fault-tolerant software
PDF Full Text Request
Related items