Font Size: a A A

Research And Implementation Of Fault Recovery Mechanism In Large-scale Graph Processing System Based On BSP

Posted on:2013-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:C NingFull Text:PDF
GTID:2268330425497330Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Parallel computing system based on the BSP model is applied to graph calculation of massive data, matrix calculation and so on, which need multiple iterations for computing. And it is similar to MapReduce parallel computing system, which will also be based on low-cost servers and local computation resources to improve the parallel processing capabilities. The above-mentioned cluster is composed by large-scale commercial computers, the node failure often emerges, so how can we tolerate the emergence of these failures and take remedial measures in a timely manner is necessary. The purpose of the research is the design and implementation of a recovery mechanism of the system that has strong fault tolerance ability. Therefore, to achieve fault recovery transparency and improve the efficiency of the fault detection, fault diagnosis, fault recovery and reduce the impact of the system’s normal work will be the focus of the research and the difficult problems.In response to these issues, we achieve the backward fault recovery mechanism which is based on incremental checkpoint, it can not only save storage resources, but also make the failure occurred relatively transparent. We has three main aspects:First of all, because of reading and writing checkpoint will reduce the efficiency of the system for the large amount of data, on the basis of the traditional checkpoint mechanism we achieve incremental read and write checkpoint method, this not only greatly improve the speed of read, write checkpoint, but also effectively save the system’s storage resources, when the amount of data reaches a certain size, incremental read, write checkpoint can effectively improve the efficiency of the system; Second, the mechanism achieves fault detection based on the "heartbeat", this successfully solve the problem that the master how to get the information of the faults. Master can access to faults information timely by the "heartbeat", and trigger the corresponding recovery strategy. Finally, because of the randomness of the failure for the running job, we will divide the appeared failure into three stages based on the before, during, and after iterative calculation process. Processing it according to the stage and type of failure occurred.After the actual deployment and application, the fault recovery mechanism in the large-scale graph processing system based on BSP achieved the desired effect. The system is able to detect various types of failures that may occur during the job is running through the "heartbeat", and according to the stage of failure in which the job runs and the specific type of fault to take a different fault recovery strategy. At the same time, the fault-tolerant system has good scalability, it is convenient to increase the fault kinds that able to handle, or through further revision applied to the related systems’ fault recovery.
Keywords/Search Tags:bsp, graph processing, checkpoint, fault detection, fault recovery
PDF Full Text Request
Related items