Font Size: a A A

Top-k SimRank Algorithm Optimization And Its Application In Scientific Literature Retrieval

Posted on:2017-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:R Q LiFull Text:PDF
GTID:2428330569998550Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Similar scientific literature retrieval occupies a very important proportion in academic research.With the exponential growth of scientific literature data,the traditional similarity retrieval method based on text content encounters the bottleneck of accuracy and speed in practical application.In recent years,the focus of this field has been shifted to similarity retrieval based on the relationship between scientific literatures,especially the technologies based on the graph structure.At present,an effective similarity retrieval method is to first construct a document network,and then further analysis and evaluation are carried out using a certain node pair similarity evaluation index in the network.SimRank is one of the most widely used node pair similarity evaluation index in the network.This paper studies the optimization of SimRank algorithm and its application in scientific literature retrieval.The basic idea of SimRank can be summarized as the more similar are the neighbors of two nodes,the more similar are the two nodes;literally,the similarity between two nodes is determined by the similarity between their neighbors.However,the current computation methods of SimRank cost much in time and space and the performance is not good in practice.The current mainstream algorithms are mainly based on matrix iterative form of SimRank.The methods take the entire network as input and iteratively compute similarity score.In each iteration,it needs to update the similarity score of all node pairs.On the one hand,it costs huge time and space.Its scalability is not good.On the other hand,users only concern with node pairs of the high similarity score in practical application,but the calculation result of this kind of methods is the similarity score of all node pairs in the whole network.With too much unnecessary answers,the utilization rate of the result is low.To solve this problem,this paper designs a Top-k SimRank algorithm framework based on the random walk model of SimRank,which aims to eliminate the node pairs with low similarity score in time and calculate the ones with high similarity score quickly and accurately.The practical data is used to test its efficiency in application.Firstly,this paper designs the incremental algorithm of SimRank based on the random walk model of SimRank.The incremental algorithm is on the basis of the equivalent relationship between SimRank and the first meeting probability of two random walks start from two nodes and walks randomly.The advantage of the incremental algorithm is that it does not need to recalculate the whole network in each iteration.Instead,it only adds the similarity score calculated in this iteration to the result of last iteration so that the result is non-decreasing.This is suitable for the determination of threshold and non-candidate nodes.Next,on the basis of the incremental algorithm,the iterative batch pruning framework of SimRank is designed.This framework can quickly eliminate the node pairs with low similarity to save time and space cost.Under this framework,we further define the concept of the “super-node” and design the upper-bound that based on the super-node as the basis of the batch pruning point.Compared with the existing upper-bound that based on the summation of geometric sequences,the upper-bound based on the super-node considers the adjacency relationship of the nodes in the network and can provides upper-bound with higher accuracy yet costs little space.We also design the method to use the upper-bound based on the super-node in the iterative batch pruning framework.It saves the time cost efficiently by early eliminating some non-candidate nodes even when the upper-bound based on the upper point is not completely calculated.Finally,considering the data in real world is more than huge,this paper further designs the optimization strategy of the algorithm for large networks by combining the upper-bounds based on the summation of geometric sequences and the upper-bound based on the super-nodes.In order to further evaluate the practical significance and efficiency of the proposed technology,this paper carrys out applied research on real data.We shift the existing scientific literature data to a network.Specifically,we transfer one document to one node and the citing relationship between two documents to one edge in network.The results show that the time and space costs of the proposed technique are in line with the practical application requirements.
Keywords/Search Tags:Top-k similarity join, Iterative batch pruning, Upper-bound, Scientific literature retrieval
PDF Full Text Request
Related items