Font Size: a A A

Research And Implementation On Incremental And Iterative Processing Techniques For Large-Scale Graphs

Posted on:2014-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z G WangFull Text:PDF
GTID:2268330425991824Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of social networks, bioinformatics networks and information technologies, the graph theory and its algorithms have attracted wide attention. For academia and industry, it has been a hot topic to develop platforms for incrementally and iteratively processing large-scale graphs in cloud computing environments. Nevertheless, few existing methods are designed to optimize data partitioning, disk buffer management and network communication, which are the three key issues for improving performance. The incremental and iterative process of large-scale graphs has three features:strong coupling of graph data, highly iterative frequency and poor locality of data-access. Therefore, it is a nontri vial task for optimizing the three key issues, which is the main work of this thesis.Data partitioning. We analyze the locality of raw graphs generated by the depth-first search (DFS) algorithm and the breadth-first search (BFS) algorithm, and then propose the continuous partition (CP) strategy. Compared with the random hash partition (RHP) method, the CP strategy can preserve the locality of raw graphs, avoid shuffling all graph data by networks and balance the computation load. Meanwhile, to provide the addressing service for network communication during iterations, we design two encoding methods:a Hadoop-based encoding mechanism and a DHT-based Hybrid-MT mechanism, which can convert the string vertex identifiers into the consecutively numbered identifiers.Disk buffer management. By analyzing the incremental and iterative process, we propose a state-transition model. Then a static hash index based on the oriented-column storage model and a tunable hash index based on the Markov model have been designed to optimize disk I/O costs. The static hash index can improve the locality of matching message data with local graph data and avoid the random data-access. Further, based on the state-transition model and the Markov cost-benefit analysis model, our tunable hash index can adjust the bucketing granularity dynamically to minimize I/O costs. Network communication. We propose the EBSP model by extending the traditional Bulk Synchronous Parallel (BSP) model. Then we design a hybrid iteration mechanism, which means graph data must be processed synchronously but message data can be handled across two consecutive iterations. Especially, the across-step message pruning (ASMP) method and the across-step message combination (ASMC) method are two optimization strategies for message processing, which can reduce the message scale and accelerate the spread of messages respectively. Furthermore, this thesis designs a VCCP (Vertex-Cut based on the CP strategy) method by integrating the Vertex-Cute technology with the CP strategy. Benefiting from preserving the raw locality of BFS-generated graphs, the VCCP method can improve the overall performance greatly with the low overhead for preprocessing.At last, we have designed and implemented a prototype system, namely DiterGraph, to validate the effect of our mechanisms by extensive experiments, including data partitioning, hash index, ASMP and VCCP. For the single source shortest path computation, the performance of DiterGraph compared to Giraph is roughly a factor of2. Compared with Hadoop and Hama, it is up to21and43respectively. For the PageRank computation, the overall performance of DiterGraph is20times faster than that of Hadoop.
Keywords/Search Tags:large-scale graph, incremental and iterative process, data partitioning, disk andindex, network communication
PDF Full Text Request
Related items