Font Size: a A A

Research On Large Scale Graph Analytic System For Supporting Fast Random Walks

Posted on:2022-06-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:R WangFull Text:PDF
GTID:1488306323963719Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the scale of data has grown rapidly,and meanwhile the interaction between data has become increasingly complex.Mining the rich value of data and its correlation can provide a lot of valueable information for our daily life.Graph can naturally express the associated relationship between entities in the real world,and graph analysis can fully explore the hidden information in these associated relations.Therefore,graphs and graph analysis are widely used in many ap-plications.However,the increasing scale of the graph data brings challenges to graph analysis.On the one hand,the traversal analysis on the large graph is difficult to run efficiently,so the sampling method based on random walk can achieve reliable approxi-mate calculation.Due to its high computational efficiency and solid theoretical support,random walk is often used in some important graph analysis,sorting and embedded al-gorithms.On the other hand,large scale graph data can not fit the memory of a single machine,so many disk-based single-machine graph processing systems were proposed to store the large scale graph data in disk,and processed graph analysis via disk I/Os.However,existing graph processing systems have some limitations when support-ing random walk based applications:(1)At the level of graph data storage,existing graph processing systems usually adopt an iterative model,which successively and it-eratively loads graph data blocks from disk into memory for computational analysis,thereby alleviating performance bottlenecks caused by heavy random I/O to the disk.However,this iteration-based model shows very low I/O efficiency when supporting random walks.Thus,the efficiency and scalability of the random walks on the graph are limited.(2)At the level of random walk calculation,the iteration-based synchronous compute model used in the existing work,sacrificed the computational efficiency of the random walk in order to ensure the synchronism between all walks.In addition,the storage management of random walk data used in the existing graph processing systems usually adopt an edge/vertex-centric indexing strategy and dynamic arrays based storage method.These methods not only bring a very large indexing overhead,but also frequent memory re-allocation cost,which resulting to massive memory fragmentations and ad-ditional time overhead.Therefore,the scale of graphs and concurrent random walks that can be processed by the graph processing systems are both limited.(3)At the level of random walk algorithm,random walk based graph sampling algorithms need to col-lect samples after the random walk converges to the stationary distribution,in order to achieve unbiased estimation.However,existing random walk algorithms usually re-quire tens of thousands of walk steps to converge to the stationary distribution,which seriously affects the sampling efficiency.This dissertation focuses on the random walk based applications,and aims for the three levels of problems stated above,respectively studies the random walk friendly graph storage management,random walk computing framework and random walk algo-rithm with fast convergence rate.Finally,we implementation a complete graph analysis system that supporting fast random walks.The main research contents and contributions of this dissertation are state as follows.(1)Random walk friendly graph storage management system.Combining with the characteristics of random walk based applications,we pro-pose a walk-state-aware I/O model.Based on this I/O model,we design and implement an efficient graph data storage management mechanism SAStore for supporting large scale concurrent random walks.According to the number of random walks,the walk steps and the distribution of the walks in the graph,SAStore adopts different graph par-titioning,loading and caching strategies to maximize the I/O efficiency.In addition,to solve the problem of slow update efficiency in the scenario of dynamic graph process-ing,SAStore further proposes a new blocked log method based on CSR to enable fast graph updates.Experimental evaluations on real world graphs show that,compared with the state-of-the-art random walk graph processing system Drunkardmob,SAStore can improve the I/O efficiency by 2 times to 4 times on average.In the scenario of dynamic graph,blocked CSR with edge log management significantly improve the efficiency of dynamic graph updates compared with the basic CSR storage format.(2)Fast random walk computing framework.In order to realize a graph computing framework supporting for fast random walks,we firstly propose an asynchronous random walk updating strategy to speed up the walk updating rate.Then we propose a block-centric walk indexing strategy,and use a fixed-length buffer to store all walks of each subgraph,thus to avoid frequent memory reallo-cation caused by a large number of dynamic arrays.Finally,we implement a prototype system GraphWalker that supports fast random walks.The experiment results show that GraphWalker can efficiently handle very large graphs with billions of vertices and hundreds of billions of edges,and GraphWalker can also process massive con-current random walks,i.e.,tens of billions of random walks with thousands of steps.GraphWalker can achieve more than an order of magnitude speedup compared with the random-walkspecific system DrunkardMob,as well as two stateof-the-art single-machine graph systems,Graphene and GraFSoft.Furthermore,GraphWalker is more resource friendly as its performance is even comparable with the state-of-the-art dis-tributed random walk system KnightKing running on a cluster of eight machines.(3)Random walk algorithm with fast convergence rate.A very important reason for the the slow convergence rate of random walk algo-rithm is that random walk may often be stuck in a local subgraph,and may take many steps to walk out,thus retard the random walk to fast explore the whole graph and slow down the convergence rate.Therefore,we design a non-backtracking commom neigh-bor aware random walk algorithm NB-CNARW,to use the historical information of the random walk and the domain information of each candidate vertex to realize the weighted random walk,thus to accelerate the convergence rate of the random walk and improve the sampling efficiency of the random walk.Experimental results show that the proposed NB-CNARW algorithm is better than the latest random walk algorithm.It can significantly improve the convergence speed of the random walk,and the number of steps required for convergence can be reduced by 29.2%at most.When using NB-CNARW algorithm for unbiased estimation based on graph sampling,given the same estimation accuracy,the query cost of sampling can be reduced by up to 25.4%.
Keywords/Search Tags:Random Walk, Graph Sampling, Graph Process System, Storage Management, I/O Schedule, Compute Model
PDF Full Text Request
Related items