Research About Fault-tolerance For Large-scale Graph Processing

Posted on:2016-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:P Wang

Full Text:PDF

GTID:2308330476453490

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Many real world computation problems concern large graphs, such as analytics on social networks. To e?ciently process large-scale graph datasets, Google proposed graph computation whose key idea is “think like a vertex”. Graph computation has become increasingly popular due to its effectiveness and e?ciency in a wide range of areas including web search, natural language processing and recommendation systems.With the algorithm complexity and dataset sizes continuously increase, it is now a common practice to run many MLDM algorithms on a cluster of machines. For example,Google has used hundreds to thousands of machines to run some MLDM algorithms.In the distributed environment, there are many kinds of faults, such as machine crash and power failure, which makes fault tolerance a must for distributed graph computation. The main contributions of this paper are as follows:1. A comprehensive analysis of current checkpoint-based fault tolerance mechanisms for graph computationBecause of complex computation on vertices and dependency among vertices,most graph-parallel systems use checkpoint to provide fault tolerance. These approaches have two problems: for normal execution, they will incur notable performance overhead; and for recovery, they have lengthy recovery time. For the system adopting this approach, during computation, the runtime system will periodically save the runtime states into a checkpoint on some reliable global storage, e.g., a distributed ?le system. When some machines crash, the runtime system will reload the previous computational states from the last checkpoint and then restart the computation. However,as the processes of checkpoint and recovery require a lot of costly network requests,such approaches incur notable performance overhead during normal execution as well as lengthy recovery time from a failure. Consequently, though most existing systems have been designed with fault tolerance support, they are disabled during production run by default.2. A new replication-based fault tolerance approach for graph computationMany distributed graph systems require creating replicas of vertices to provide local access semantics such that graph computation can be programmed as accessing local memory. Such replicas possess states of original vertices, thus they can be easily extended to provide fault tolerance. Based on this observation, this paper proposes Imitator, a new fault tolerance approach based on replication. Imitator can provide fault tolerance with negligible performance overhead, and its recovery is fast. These effects are achieved by two design principles in Imitator:? Imitator tries to reuse existing mechanisms to reduce the performance cost, Imi-tator backups vertex states by replicating them to existing replicas and maintainsthe freshness of replicas by extending existing message mechanism;? Imitator tries to harness the cluster resource to parallel recovery, by balancingthe distribution of replicas, Imitator distributes the recovery tasks to nodes inthe cluster during recovery, thus it can make a full use of the cluster resource torecover from failures.Evaluation shows that Imitator incurs negligible performance overhead(less than5% for all cases) and can recover from failures fast, outperforming the checkpoint-based recovery by up to 17.67X(from 3.55X)...

Keywords/Search Tags:

graph-parallel system, fault tolerance, recovery, replication

PDF Full Text Request

Related items

1	Research On Adaption Method Of Cloud Fault Tolerance Services Based On User Requirement And Resource Constriction
2	Study On Fault-Tolerance Mechanism And Realization In Real-Time Distributed Computer Systems
3	Research On Data Fault-Tolerance Of Parallel File System
4	XDFT: An Extensible Dynamic Fault Tolerance Mechanism For Cooperative Plotting System
5	Research On Fault-Tolerance Technology For Message-Passing System
6	Study And Realization Of Fault-tolerant Technology For Parallel Computer On Satellite
7	Performance Analysis And Optimization For Virtual Machine Level Fault Tolerance
8	Data Replication And Fault Tolerance In The Grid Environment
9	The Design And Implementation Of Fault Tolerant Web Service Based On Improved Passive Replication
10	Research On Rollback Recovery Fault-Tolerance Technology In High Availability Cluster