SimRank Computation On Large Graphs Based On Spark

Posted on:2019-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:X K Gao

Full Text:PDF

GTID:2428330545976726

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In graph-based data analysis applications,how to measure similarity between vertices in a graph is a very important research topic which has a wide range of applications in many fields.Among all the measures,SimRank is the one with growing popularity in recent years.Compared with others,SimRank is able to capture the topology of the graph.However,computation of SimRank is costly in both time and space,making traditional computing methods failing to handle graph data with ever-growing size.On the other hand,as today's popular big data computing engine,Apache Spark significantly simplifies user's programming logic with the interface it provides when developing distributed applications.Therefore,to design an efficient distributed algorithm to compute SimRank similarity for vertex pairs in large-scale graphs based on Spark or other systems is of great significance.This paper addresses the above problems and our main contributions are summarized as follows:1.To compute the single-source SimRank similarity,we proposed and imple-mented an algorithm based on the random-walk model.The algorithm reduces time complexity by producing less random walks,reduces space complexity by compressing the representation of intermediate data,achieves high parallelism by distributing the matching process of random walks to the whole cluster.Experiments on real-world data show that the number of random walks has a decrease of two or more orders of magnitude,and the running time achieves near-linear scalability.2.We proposed a distributed multi-level partitioning algorithm based on mod-ularity maximization for large graphs.In the processing of partitioning,the local dense subgraphs are well preserved within blocks.Load balance of ver-tices is carefully maintained with respect to specific constraints,and efforts are made to minimize the edge cut weights between blocks.Experiments show that,the algorithm achieves good partition quality comparable to METIS,and good scalability.3.With the help of the above graph partitioning scheme,we proposed an divide-and-conquer algorithm to compute the all-pair SimRank similarity.The al-gorithm partitions the graph into blocks,obtains similarity of each node-pair within blocks,then treats each block as a new vertex forming a coarsening graph to get the similarity between each block.A global similarity for all vertex pairs is computed based on these two similarities.Experiments show that the effectiveness of our algorithm is comparable to SimRank.while the computing efficiency is increased by several to tens of times(3-16x).

Keywords/Search Tags:

Graph Data, SimRank Similarity, Distributed Computing, Spark

PDF Full Text Request

Related items

1	Research On A Distributed Graph Data Process Mechanism Based On Spark
2	Research On Distributed Simrank Algorithms And Improvement Strategies
3	Single-source SimRank Computating And Its Application In Collaborative Filtering
4	The Research Of Graph Computing Framework Supporting Spatial And Temporal Data Management Based On Spark
5	Research On Balanced Graph Partitioning Based On Vertex Cut And Its Implementation On Spark
6	Graph Reachability Distributed Computing And Application Based On Spark
7	A System For Distributed MD Data Analysis Based On Spark
8	Efficient Algorithm For Mining Dense Subgraphs In Uncertain Graph
9	Research And Implementation Of Similarity Join For Big Data
10	Design And Implementation Of Distributed Graph Computing Engine