Font Size: a A A

SimRank Computation On Large Graphs Based On Spark

Posted on:2019-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:X K GaoFull Text:PDF
GTID:2428330545976726Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In graph-based data analysis applications,how to measure similarity between vertices in a graph is a very important research topic which has a wide range of applications in many fields.Among all the measures,SimRank is the one with growing popularity in recent years.Compared with others,SimRank is able to capture the topology of the graph.However,computation of SimRank is costly in both time and space,making traditional computing methods failing to handle graph data with ever-growing size.On the other hand,as today's popular big data computing engine,Apache Spark significantly simplifies user's programming logic with the interface it provides when developing distributed applications.Therefore,to design an efficient distributed algorithm to compute SimRank similarity for vertex pairs in large-scale graphs based on Spark or other systems is of great significance.This paper addresses the above problems and our main contributions are summarized as follows:1.To compute the single-source SimRank similarity,we proposed and imple-mented an algorithm based on the random-walk model.The algorithm reduces time complexity by producing less random walks,reduces space complexity by compressing the representation of intermediate data,achieves high parallelism by distributing the matching process of random walks to the whole cluster.Experiments on real-world data show that the number of random walks has a decrease of two or more orders of magnitude,and the running time achieves near-linear scalability.2.We proposed a distributed multi-level partitioning algorithm based on mod-ularity maximization for large graphs.In the processing of partitioning,the local dense subgraphs are well preserved within blocks.Load balance of ver-tices is carefully maintained with respect to specific constraints,and efforts are made to minimize the edge cut weights between blocks.Experiments show that,the algorithm achieves good partition quality comparable to METIS,and good scalability.3.With the help of the above graph partitioning scheme,we proposed an divide-and-conquer algorithm to compute the all-pair SimRank similarity.The al-gorithm partitions the graph into blocks,obtains similarity of each node-pair within blocks,then treats each block as a new vertex forming a coarsening graph to get the similarity between each block.A global similarity for all vertex pairs is computed based on these two similarities.Experiments show that the effectiveness of our algorithm is comparable to SimRank.while the computing efficiency is increased by several to tens of times(3-16x).
Keywords/Search Tags:Graph Data, SimRank Similarity, Distributed Computing, Spark
PDF Full Text Request
Related items