Font Size: a A A

Design And Implementation Of BSP Model Oriented Large Graph Data Partitioning Algorithms

Posted on:2014-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhouFull Text:PDF
GTID:2308330482952811Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of society, a variety of graph data sets, such as social networks like microblog, Twitter, Renren, etc., communication networks and so on, become bigger and bigger. Traditional graph processing tools are difficult to accomplish these computing based on large data sets, so it is urgent to develop a new processing system for the calculation of the mass graph data. The large-scale graph data processing system, Pregel, brought up by Google based on the BSP model provides a guideline for designing and developing a large-scale graph data processing system, at the same time, the well-known cloud computing provides technical supporting. Graph partition for large-scale graph processing system based on the BSP model is still a problem that cannot be avoided. Especially in the cloud computing environment, it more needs to partition the input data into many partitions, and then put them to the computing nodes in the cluster. However, the traditional graph partitioning technology requires multiple iterations, the time complexity is too high, and the result doesn’t save the map information from the vertexes to the partitions, it does not apply to the graph data partition based on the BSP model. So, how to implement a good and fast partition algorithm is a very big challenge.Therefore, in the previous work, our project team developed a graph processing system that can process large-scale graph data, called BC-BSP. It is based on the BSP model, using the distributed parallel processing design ideas in the Hadoop framework. In this paper, we design and implement the data partition algorithm used in the data partition module. The main contributions are as follows:(1) Put forward a BSP job running cost model. Using this cost model, we analyze which factor determines a job’s time overhead. (2) Design and implement three algorithms:First, a balanced partition algorithm based on sampling. Sample a certain number of records during the time of reading data from HDFS or databases, then establish a histogram according to these sampling. When reading data, we put the data falling on the corresponding histogram to the corresponding partition, In this way, we can approximately ensure the balance of the number of vertexes in each partition. This algorithm can put all the webpages from the same website together. Second, a balanced Hash partition algorithm based on out-degree, called BHP (Balance Hash Partition). The concept of virtual bucket is introduced in this algorithm. Then these virtual buckets are reorganized into desired partitions by using a greedy algorithm. In this way, the algorithm can ensure the balance of each partition. At the same time, the data localization strategy makes the data on the split locate on the corresponding node as much as possible. So, the overhead of the data migration during the data loading can be reduced. Third, an edge cluster BHP algorithm, called ECBHP (Edge Cluster BHP). The algorithm uses an EC heuristic rule to evaluate the topology density from the virtual buckets to each partition, and then we reorganize the buckets to partitions based on it. In this way it can ensure the inner-partition has a higher coupling degree, while inter-partition has a lower coupling degree.Applying the three algorithms proposed by this paper into the BC-BSP system, we take a series of experiments. They show that all the three algorithms complete the demand of the graph partition module in the BC-BSP system, and they have good scalability and stable performance. Experiments also show that the Sample algorithm achieves a more balanced on vertexes’number in each partition than the Hash algorithm; the job runtime improves more than 10% than the Hash algorithm. BHP and ECBHP algorithms accomplish the edge balance. At the same time, they reduce the inter-edges in each partition. The running time of the two algorithms is 30% better than the Hash algorithm.
Keywords/Search Tags:BSP, graph partition, distributed system, sampling, load balance
PDF Full Text Request
Related items