Font Size: a A A

Research On Distributed Graph Computing Performance Optimization For Natural Graphs

Posted on:2017-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:C F ZhangFull Text:PDF
GTID:2428330569999073Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of web search technology,social network,biological information science and the implementation of human brain plan,graph theory knowledge and algorithm have been widely applied and developed.The scale of data in the graph field is increasing at an unprecedented speed.With the rapid development of cloud computing technology,distributed graph computing has become a research hotspot in academia and industry.It has become a key problem to optimize the performance of the distributed graph computing platform by improving the data partitioning efficiency,reducing the memory storage cost and reducing the communication between machines.GraphA is based on the Spark distributed platform to achieve a simple graph distributed processing platform,this paper studies the GraphA data partitioning and memory storage management efficiency of the distributed graph computing performance optimization technology,focusing on adaptive partitioning algorithm and the storage method based on ART-index.The main contents and contributions of this paper include:Firstly,the basic principles and characteristics of distributed graph computing are deeply studied and analyzed,including three basic distributed computing frameworks of MapReduce,BSP and GAS,which are the basic support technologies of this paper.In addition,the paper summarizes the merits and application values of previous researches,and points out their respective limitations in combination with the power-law distribution characteristics of natural graphs and their application.Secondly,an adaptive graph partitioning algorithm is proposed to solve the problem of data partitioning.Both social network data and web page data are power-law distribution,there are high-dimensional vertices and low-dimensional vertices in the data.Compared with the existing algorithm of vertex-cut,edge-cut and hybrid-cut,adaptive partitioning algorithm does not distinguish between high-dimensional vertex and lowdimensional vertex,provides a unified partitioning scheme,through the design of a series of hash function,with the use of global distributed partitioning table which records all vertices partitioning,realizing efficient partitioning of graph data.The experimental results show that the adaptive partitioning algorithm can balance the machine load and keep the original locality of the input graph data,reduce the traffic between machines during each iteration,and avoid the global shuffle in the data partitioning stage.Which satisfies the demand of performance optimization in data partitioning problem.Thirdly,a graph storage model based on ART index is proposed,aiming at the efficiency of memory management.Graph algorithms have the characteristics of incremental update,not every round of iteration to update all the vertices in the graph or all the properties value of the edge.Spark RDD(Resilient Distributed Dataset)has the immutable feature.Originally designed to ensure data consistency,Spark RDD updates the data incrementally which will do a copy of the original data,so a new iteration of RDD is expensive.GraphA introduces a fine-grained updated RDD storage structure,which adopts ART index to ensure the consistency of data and update the graph data in a fine-grained way.It speeds up the query update and reduces the memory overhead.In this paper,large-scale data partitioning algorithm and the optimization of data storage technology are explored.The research results have certain theoretical value and guiding significance for the performance improvement of distributed graph computing platform.Finally,the performance of GraphA is verified by a large number of experiments.
Keywords/Search Tags:Graph Computing, Data partitioning, Storage index, ART
PDF Full Text Request
Related items