Research On Distributed Simrank Algorithms And Improvement Strategies

Posted on:2015-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:H Liu

Full Text:PDF

GTID:2348330482457025

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In many fields, such as entity resolution, social network analysis and link prediction, mesuring similarity between objects is a basic problem. SimRank is a widely used model for computing similarity, it measures similarity between objects based on graph's topology. It's based on a clear human intuition: two objects are similar if they are related to similary objects.With the development of Internet, the data grows explosively, so does graph data. The main drawback of the naive SimRank algorithm is its computation and space complexity, it cann't be applied to large graphs without changes. Nowadays, the parallel computation model of BSP(Bulk Synchronization Parallel) and MapReduce can effectively solve big data processing problem, however, if applied to compute SimRank, there still exists som drawback: The computation for SimRank based on BSP will send a lot of messages in each iteration, and and for those already convergent vertexes, they still receive messages and recompute similarity. And although the MapReduce model can speed up computation in some extent, when faced with larger data, time complexity and communication will increase sharply.In this thesis, we improved distributed SimRank algorithm based on BSP and MapReduce. Specifically, this thesis has made the following contributions.(1) We systematically introduce the research status at home and abroad of SimRank, briefly summarize the representative related work, and point out their advantages and disadvantages, then analyze the deficiency of present research.(2) A distributed SimRank algorithm is presented based on G2 graph under BSP framework. Firstly, we construct and simplify the G2 graph. Based on it we implement the naive distributed SimRank algorithm under BSP framework. Secondly, we analyzed the efficicency of the naive algorithm. And then we present the Delta-SimRank algorithm based on G2 graph under BSP framework.(3) we implement some simple distributed SimRank algorithms. Also, we propose a two stage distributed SimRank algorithm based on the path index under MapReduce framework. In the first stage, we proposed an algorithm of path index construction, which can generate all k-paths and probability. In the second stage, we compute similarity based on the path index. We propose three optimization strategies:first, set threshold to filter path index; second, block paths by path tree; third, ensure load balance by allocating blocks to different nodes.(4) The experimental results verified the feasibility and the effectiveness of key technique proposed in this paper. The algorithms proposed in this thesis are better than other algorithms in timecost and scalability.

Keywords/Search Tags:

Distributed SimRank, MapReduce, BSP, G~2 graph, path index

PDF Full Text Request

Related items

1	SimRank Computation On Large Graphs Based On Spark
2	Research On Keyword Search On Graphs Based On MapReduce
3	OBF-Index:A Distributed Multi-Dimensional Index Based On Ordinal Bloom Filter
4	Single-source SimRank Computating And Its Application In Collaborative Filtering
5	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
6	Efficient Indexing And Querying System For Large-Scale Graphs
7	Research On Reachability Preserving Graph Based On MapReduce
8	Research On Data Index Application In The MapReduce Framework
9	A Study And Implementation Of Scalable Data Index Based On Mapreduce
10	Distributed Implementation Of A Parallel Tree And Graph Computation Framework