Font Size: a A A

Design And Implementation Of Data Partitioning In BSP-Based Big Graph Processing System

Posted on:2013-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2268330425497302Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Extracting knowledge by performing computations on graph is becoming increasingly challenge as graphs grow in size. Modern graph datasets are huge. Such as FaceBook, twitter, RenRen, the amount of data is very huge. Traditional graph processing tools are difficult to complete these calculations in such huge data set. It is urgent that a new graph processing system be developed. Google has developed a big graph processing system based on BSP model, called Pregel. This provides a guideline for the design and development of graph processing system. The most popular cloud computing technology provides technical supports for developing such system. However, any processing system can not avoid a problem that already exists that is graph partition problem. Disributed parallel processing in the cloud computing environment data need to be divided into multiple partitions. Data set is divided into multiple partitions and processed parallelly by the computeing nodes in the cluster. How to achieve a good partition remains a difficult and a great challenge.In order to solve the above problems, we borrow the idea from cloud computing programming model Hadoop which is a distributed parallel processing to develope a big graph processing system based on the BSP model that can handle large graph dataset. This thesis mainly discusses the design and implementation of data partitioning module of the system. The main contributions of this thesis are as follows. Firstly, Analysis graph computation features design and implement the data partitioning module of the system by borrowing the design idea from the existing graph process system. Provides complete user interface that users can set flexible. The users can use the default partitioning strategies or customize their own partitioning algorithms by the provided interface. Data partitioning module is working well in the system. It is integrated into the system. Secondly, we implemente three graph data paratitioning algorithms, which are range partitioning algorithm MD5-based hash partitioning algorithm, and balanced and optimized modulo hash algorithm based on virtual paratition. Thirdly, we implements the multiple imput formats support. We compare and analysis the similarity of HDFS and HBase storage design, integrate the input formats of HDFS and HBase and provides a unified interface design. The user can customize their own input format according to their needs. Finally, in order to achieve graph algorithms, wei provide the necessary support components, such as, Multi-threaded data sending based on RPC, ring buffer, global synchronization and optimizer, etc.The experimental results and the practical applications show that the data partitioning module of the big graph processing system has been implemented and reached the design goals of the system. It has good scalability and stability. We compare the performace of the three different data partitioning algorithms from load balance, communication const, time cost by many experiments. It is concoluded that the optimized and balanced hash paratitioning method has better performance. Range partitioning algorithm has the best performance when the data sets hava good local aggregation properties.
Keywords/Search Tags:graph partitioning, distributed system, balanced partitioning, BSP, RPC
PDF Full Text Request
Related items