Font Size: a A A

Research About Fault-tolerance For Large-scale Graph Processing

Posted on:2016-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2308330476453490Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Many real world computation problems concern large graphs, such as analytics on social networks. To e?ciently process large-scale graph datasets, Google proposed graph computation whose key idea is “think like a vertex”. Graph computation has become increasingly popular due to its effectiveness and e?ciency in a wide range of areas including web search, natural language processing and recommendation systems.With the algorithm complexity and dataset sizes continuously increase, it is now a common practice to run many MLDM algorithms on a cluster of machines. For example,Google has used hundreds to thousands of machines to run some MLDM algorithms.In the distributed environment, there are many kinds of faults, such as machine crash and power failure, which makes fault tolerance a must for distributed graph computation. The main contributions of this paper are as follows:1. A comprehensive analysis of current checkpoint-based fault tolerance mechanisms for graph computationBecause of complex computation on vertices and dependency among vertices,most graph-parallel systems use checkpoint to provide fault tolerance. These approaches have two problems: for normal execution, they will incur notable performance overhead; and for recovery, they have lengthy recovery time. For the system adopting this approach, during computation, the runtime system will periodically save the runtime states into a checkpoint on some reliable global storage, e.g., a distributed ?le system. When some machines crash, the runtime system will reload the previous computational states from the last checkpoint and then restart the computation. However,as the processes of checkpoint and recovery require a lot of costly network requests,such approaches incur notable performance overhead during normal execution as well as lengthy recovery time from a failure. Consequently, though most existing systems have been designed with fault tolerance support, they are disabled during production run by default.2. A new replication-based fault tolerance approach for graph computationMany distributed graph systems require creating replicas of vertices to provide local access semantics such that graph computation can be programmed as accessing local memory. Such replicas possess states of original vertices, thus they can be easily extended to provide fault tolerance. Based on this observation, this paper proposes Imitator, a new fault tolerance approach based on replication. Imitator can provide fault tolerance with negligible performance overhead, and its recovery is fast. These effects are achieved by two design principles in Imitator:? Imitator tries to reuse existing mechanisms to reduce the performance cost, Imi-tator backups vertex states by replicating them to existing replicas and maintainsthe freshness of replicas by extending existing message mechanism;? Imitator tries to harness the cluster resource to parallel recovery, by balancingthe distribution of replicas, Imitator distributes the recovery tasks to nodes inthe cluster during recovery, thus it can make a full use of the cluster resource torecover from failures.Evaluation shows that Imitator incurs negligible performance overhead(less than5% for all cases) and can recover from failures fast, outperforming the checkpoint-based recovery by up to 17.67X(from 3.55X)...
Keywords/Search Tags:graph-parallel system, fault tolerance, recovery, replication
PDF Full Text Request
Related items