Font Size: a A A

Distributed Algorithm For Pairwise Vertex Similarity Calculation On Big Graphs

Posted on:2019-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2370330545476726Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The pairwise vertex similarity calculation,i.e.finding out all pairs of similar vertices with similarity scores greater than or equal to a given threshold on a graph,is a fundamental algorithm for many graph analytic applications,such as user recommendation and link prediction.However,the calculation of vertex similarity scores is a problem with a high computational complexity,especially in the big data era.Therefore,the research of algorithms for this problem is important for both analytic study and actual application.The existing distributed algorithms for pairwise vertex similarity calculation are mainly classified into two categories:the filtering-based algorithms and the all-pair algorithms.However,the existing methods still have many problems.The existing distributed all-pair algorithms are not efficient enough because they shuffle many redundant adjacency sets to achieve high computation parallelism but unfortunately incur high network communication costs.At the same time,all the existing algorithms cannot perform efficiently within the whole threshold range from 0 to 1.The filtering-based algorithms are efficient in the high threshold while the all-pair algorithms perform better in the low threshold.In addition,the shuffle implementation in the cutting-edge data-parallel platforms like Hadoop and Spark is disk-based.Since the large-scale pairwise vertex similarity calculation algorithms shuffle a lot of data,the disk-based shuffle involves frequent spilling-to-file operations and heavy disk IO during the execution,which is expensive and lowers the performance.In order to solve the problems above,in this paper,we propose two algorithms FinNOR-MR and FinNOR-Adaptive.The main contributions can be summarized as follows:1)We propose a novel MapReduce-based all-pair algorithm FinNOR-MR that shuffles adjacency sets on the partition level instead of the traditional vertex level.This technique reduces the data redundancy and network communication costs.2)We propose a novel adaptive algorithm FinNOR-Adaptive that performs well in the threshold range from 0 to 1.For the low threshold,FinNOR-Adaptive uses the all-pair calculation mode.For the high threshold,it uses the filtering-based calculation mode.We propose a cost estimation model in FinNOR-Adaptive to dynamically switch the execution mode during computation to achieve good performance in a wide threshold range.3)We implement FinNOR-Adaptive that adopts the MapReduce framework and the distributed in-memory database.The shuffle operations in MapReduce is replaced with database queries and the expensive disk-based shuffle operation is bypassed in this way.Moreover,the implementation introduces the cache mechanism to reduce the network communication costs,which has the similar effect as the partition-level shuffle optimization.These two improvements make FinNOR-Adaptive more efficient than all the existing algorithms in any threshold.4)The extensive experimental results are coincident with the theoretical analysis.The results demonstrate that FinNOR-MR is the most efficient all-pair algorithm.Compared with the state-of-the-art algorithms,,FinNOR-MR achieves up to 3.5×speedups and its network communication costs are 5.7%of the costs of SBM.The results also demonstrate that FinNOR-Adaptive can efficiently calculate similar vertices in the whole threshold.The performance of FinNOR-Adaptive is better than the state-of-the-art algorithms in the low threshold and comparable to the cutting-edge filtering-based algorithms in the high threshold.Both of the proposed algorithms have the near-linear data and node scalability.
Keywords/Search Tags:Distributed algorithms, threshold, graph algorithms, vertex similarity, common neighbors
PDF Full Text Request
Related items