Font Size: a A A

Research On Distributed Simrank Algorithms And Improvement Strategies

Posted on:2015-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2348330482457025Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In many fields, such as entity resolution, social network analysis and link prediction, mesuring similarity between objects is a basic problem. SimRank is a widely used model for computing similarity, it measures similarity between objects based on graph's topology. It's based on a clear human intuition: two objects are similar if they are related to similary objects.With the development of Internet, the data grows explosively, so does graph data. The main drawback of the naive SimRank algorithm is its computation and space complexity, it cann't be applied to large graphs without changes. Nowadays, the parallel computation model of BSP(Bulk Synchronization Parallel) and MapReduce can effectively solve big data processing problem, however, if applied to compute SimRank, there still exists som drawback: The computation for SimRank based on BSP will send a lot of messages in each iteration, and and for those already convergent vertexes, they still receive messages and recompute similarity. And although the MapReduce model can speed up computation in some extent, when faced with larger data, time complexity and communication will increase sharply.In this thesis, we improved distributed SimRank algorithm based on BSP and MapReduce. Specifically, this thesis has made the following contributions.(1) We systematically introduce the research status at home and abroad of SimRank, briefly summarize the representative related work, and point out their advantages and disadvantages, then analyze the deficiency of present research.(2) A distributed SimRank algorithm is presented based on G2 graph under BSP framework. Firstly, we construct and simplify the G2 graph. Based on it we implement the naive distributed SimRank algorithm under BSP framework. Secondly, we analyzed the efficicency of the naive algorithm. And then we present the Delta-SimRank algorithm based on G2 graph under BSP framework.(3) we implement some simple distributed SimRank algorithms. Also, we propose a two stage distributed SimRank algorithm based on the path index under MapReduce framework. In the first stage, we proposed an algorithm of path index construction, which can generate all k-paths and probability. In the second stage, we compute similarity based on the path index. We propose three optimization strategies:first, set threshold to filter path index; second, block paths by path tree; third, ensure load balance by allocating blocks to different nodes.(4) The experimental results verified the feasibility and the effectiveness of key technique proposed in this paper. The algorithms proposed in this thesis are better than other algorithms in timecost and scalability.
Keywords/Search Tags:Distributed SimRank, MapReduce, BSP, G~2 graph, path index
PDF Full Text Request
Related items