Font Size: a A A

Performance Optimization Of Distributed Graph Computation Framework Based On BSP Model

Posted on:2015-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:X DingFull Text:PDF
GTID:2308330464463424Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Under the impetus of social network data mining, web ordering, recommendation systems, and natural language processing and other large-scale data mining and machine learning applications, distributed computation model that describes data and their relations using graph data structure has now been widely adopted. Since graph algorithms are often iterative and there are complex relations between data, which relies on vertex-to-vertex communication, therefore traditional data-parallel computation frameworks represented by MapReduce [1] and its open-source implementation Hadoop are difficult to provide efficient support for such algorithms.To fill the gap between the urgent need for graph computation and no existing computing platforms can meet the computing needs of large graphs, Google developed Pregel which is based on the bulk synchronous parallel computation model. Pregel provides the same think-like-a-vertex programming model. Users only need to provide a simple compute function and the system is in charge of vertex scheduling and message communications. Pregel treats graph computation as a series of supersteps and each superstep is composed by vertex computation, message communication and the global synchronization barrier. The compute function provided by the user can only access its own data and messages from adjacent vertices sent in the last superstep, and messages sent by each vertex will arrive at the destination in the following superstep.The synchronous programming model proposed by Pregel is simple, deterministic and thus easy to debug. However, the pure message passing mechanism used by Pregel only supports one-way data passing, and therefore can’t provide efficient support for algorithms that need to actively pull information from adjacent vertices (pull-mode algorithms). For this kind of algorithms, Pregel need to keep all the vertices alive(to compute and send messages) throughout the whole computation process, resulting in a large number of duplicate computation and messages for converged vertices, which are a waste of resources and seriously affect the system performance. This paper proposes a new communication mechanism that simulates distributed shared memory through the use of a backup vertex to overcome the problems of the original mechanism, like huge amount of messages, duplicate communication and messages for pull-mode algorithms. And for the increasingly popular multi-core clusters, we also provide support for multi-core architecture, which further optimizes the performance of distributed graph computation. Test on a series of applications and real-world data sets with our implementation based on Hama (Apache’s open-source clone of Pregel) show that the new communication mechanism can enhance overall system performance 2.06 times to 8.69 times.While synchronous computation model provides simple scheduling and determinism, but converges slowly since vertices can only use data of adjacent neighbors from the last superstep. To solve this problem, researchers at Carnegie Mellon University developed GraphLab and first proposed the asynchronous model for graph computation. In the asynchronous model, each vertex will propagate its data as soon as it updates itself thus adjacent vertices can use its latest value to compute its new data and the whole computation converges much faster. Then the author observed that real-world data set often follow the pattern of power law (i.e., a few vertices may have most of the edges) and proposed PowerGraph on the basis of GraphLab. PowerGraph uses vertex-cut partition mechanism to distribute a single vertex’s computation to a set of nodes to be done in parallel thus solves the imbalance problem of computing and communications for power-law graph.Although asynchronous model converges quickly, but need to maintain a global scheduling queue for active vertices and to ensure consistency of vertex data in asynchronous environment, which is quite expensive in distributed settings. Hence both synchronous and asynchronous computing models have advantages and disadvantages, this paper proposes a hybrid computing model that combines their advantages while avoids their disadvantages. Hybrid model uses synchronous scheduling method to avoid the heavy scheduling overhead, while uses asynchronous data transfer mechanism to make vertices use the latest data from adjacent vertices to perform computation, thus accelerates the convergence of the overall computation process. Evaluations on two typical graph algorithms and three real data sets show that the hybrid model implemented based on PowerGraph can reduce 30% computation and achieve 1.2x to 2.4x performance speedup compared to synchronous model while outperforms asynchronous model by 4.6 times due to lightweight scheduling overhead.
Keywords/Search Tags:Distributed System, Distributed Graph Computation, Message Passing, Shared Memory, Synchronous Computation Model, Asynchronous Computation Model, Data Conflict
PDF Full Text Request
Related items