Font Size: a A A

Research And Implementation On BSP-Based High Performance Iterative Computation For Large Graphs

Posted on:2015-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:J P LiuFull Text:PDF
GTID:2348330473953714Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and wide application of information technology, the effective management and efficient processing of tremendous information data have become a challenge to the academy and the industry. Data generated in the applications like social network, sensor network and e-commerce, et.al, has a similar structure with graphs, and has been a hot research topic because of its potential value. The MapReduce framework proposed by Google can support solving various problems on big data, but is not efficient for algorithms of applications on graphs, such as single source shortest path, subgraph mining and so on, because of the lack of the parallelism of graph partitions. Then the BSP model based processing systemsm, with Google's Pregel as a representative, have been an important way to compute large graphs, but the communication and disk based iterative processing need more focus and further research.Based on BSP model, this thesis solves the problem how to improve the performance of the processing system for computing large graphs, in the optimization of communication and disk based processing. First of all, we construct a kind of simple but effective communication architecture on message passing model, and adjust the parameters to optimize the basic communication performance. Second, based on that architecture, with the basic horizonal partitioning, we propose an optimized communication strategy named boundary vertex replication based on clustered edges (ECBVR), and build a cost evaluation model to analyze the effectiveness of communication optimization to the system performance. Then we propose a new computing mechanism named Vertex-Edge model, and design both a simple hash index and a multiple-queue sequential index to accelerate the processing speed. On the other hand, for the optimization of disk processing, we build a memory allocation model to benefit the ratio of the memory occupied by data, and propose two kinds of disk based processing mechanisms to reduce the random access of disk, namely the data grouping iteration(DGI) and the message sequence iteration(MSI). For the first method, we discuss two kinds of grouping algorithms, the random hash grouping and the balancing grouping. For the second one, we design a new data storage model OERSV to guarantee the sequence of message receiving, and a paging mechanism to avoid the random disk seek of the vertex table.We implement the techniques proposed in this thesis in the framework of BC-BSP, a graph processing system, and conduct experiments on synthetic datasets and real datasets. For ECBVR strategy, the impact analysis of the edge cluster partitioning threshold to the running time, loading time and communication scale, proves the effectiveness of ECBVR, and the comparison of the three implements reflects the efficiency of our index mechanism. For the optimization of the disk based processing, the experimental results prove that the optimized memory allocation parameter is close to its theoretical value, and indicate that our proposed mechanisms have a fast and scalable performance.Ultimately, we compare the overall performance of our techniques integrated in the framework of BC-BSP system, with that of other famous iterative processing systems. The experimental results conducted with the pagerank algorithm prove that our implementation is of high performance and is up to 10x faster when compared to Haloop system.
Keywords/Search Tags:BSP, iterative computation on large graphs, boundary vertex replication, disk-based processing, efficient index
PDF Full Text Request
Related items