Research And Implementation Of Fault Recovery Mechanism In Large-scale Graph Processing System Based On BSP

Posted on:2013-07-23

Degree:Master

Type:Thesis

Country:China

Candidate:C Ning

Full Text:PDF

GTID:2268330425497330

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Parallel computing system based on the BSP model is applied to graph calculation of massive data, matrix calculation and so on, which need multiple iterations for computing. And it is similar to MapReduce parallel computing system, which will also be based on low-cost servers and local computation resources to improve the parallel processing capabilities. The above-mentioned cluster is composed by large-scale commercial computers, the node failure often emerges, so how can we tolerate the emergence of these failures and take remedial measures in a timely manner is necessary. The purpose of the research is the design and implementation of a recovery mechanism of the system that has strong fault tolerance ability. Therefore, to achieve fault recovery transparency and improve the efficiency of the fault detection, fault diagnosis, fault recovery and reduce the impact of the system’s normal work will be the focus of the research and the difficult problems.In response to these issues, we achieve the backward fault recovery mechanism which is based on incremental checkpoint, it can not only save storage resources, but also make the failure occurred relatively transparent. We has three main aspects:First of all, because of reading and writing checkpoint will reduce the efficiency of the system for the large amount of data, on the basis of the traditional checkpoint mechanism we achieve incremental read and write checkpoint method, this not only greatly improve the speed of read, write checkpoint, but also effectively save the system’s storage resources, when the amount of data reaches a certain size, incremental read, write checkpoint can effectively improve the efficiency of the system; Second, the mechanism achieves fault detection based on the "heartbeat", this successfully solve the problem that the master how to get the information of the faults. Master can access to faults information timely by the "heartbeat", and trigger the corresponding recovery strategy. Finally, because of the randomness of the failure for the running job, we will divide the appeared failure into three stages based on the before, during, and after iterative calculation process. Processing it according to the stage and type of failure occurred.After the actual deployment and application, the fault recovery mechanism in the large-scale graph processing system based on BSP achieved the desired effect. The system is able to detect various types of failures that may occur during the job is running through the "heartbeat", and according to the stage of failure in which the job runs and the specific type of fault to take a different fault recovery strategy. At the same time, the fault-tolerant system has good scalability, it is convenient to increase the fault kinds that able to handle, or through further revision applied to the related systems’ fault recovery.

Keywords/Search Tags:

bsp, graph processing, checkpoint, fault detection, fault recovery

PDF Full Text Request

Related items

1	The Research And Implementation Of Checkpoint Technology Based On WinNT
2	Research Of Task Recovery Stretegy Based On Checkpoint In MapReduce
3	Research And Implementation Of Intelligent Fault Recovery System On Terabit Router
4	Research On Fault Tolerance In Distributed Stream Data Processing
5	Research On The Key Technology Of Fault Tolerance Based On Fault Data Preprocessing For Supercomputing Systems
6	Research About Fault-tolerance For Large-scale Graph Processing
7	Study On Backward Recovery Of Fault Tolerant Technology In Distributed Systems
8	Fault-Tolerant Of MPI Programs Based On Rollback Recovery
9	Design And Implementation Of Application Anomaly Recovery Mechanism In Android System
10	Research And Implementation Of The Automatic Jobs Fault Tolerant Technology Based On Checkpoint