Font Size: a A A

Improvement And Implementation Of PageRank Algorithm Based On Distributed Computing

Posted on:2021-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:S YinFull Text:PDF
GTID:2518306470486394Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the progress of society and the rapid iteration of information technology,the amount of information data in the Internet is more and more large,covering more and more fields.At present,search engine has become the main means for Chinese Internet users to obtain network information.How to quickly and effectively present the required information to users,its core technology is web page sorting algorithm,the most classic of which is PageRank algorithm.However,the traditional PageRank algorithm is based on the link analysis method,which has some shortcomings,such as topic drift,bias to the old web page,average distribution of weights and so on.In addition,how to improve the calculation efficiency of PageRank algorithm is also an urgent problem to be solved under the massive data.In order to improve the shortcomings of traditional PageRank algorithm and improve the calculation efficiency of PageRank algorithm,the main research contents of this paper include the following aspects:(1)This paper analyzes the cause of topic drift of PageRank algorithm,and proposes to use semantic similarity between web pages to improve it.How Net is used as the semantic dictionary to improve the semantic similarity algorithm.The improved algorithm is used to realize word disambiguation,text semantic compression and text feature extraction,and a text similarity algorithm based on semantic feature extraction is proposed.(2)Calculate the relative topic similarity and relative time factor of webpages based on the semantic similarity and publishing time between webpages,and calculate the weight of the linked web pages.Improve the average distribution weight of the PageRank algorithm to assign weights according to the weight of the linked web pages,and a PageRank algorithm based on semantic similarity is proposed.(3)Design and implement the distributed computing experiment platform,using Nutch plugin system to complete the secondary development.In view of the problems in the implementation of PageRank algorithm on distributed computing platform,a parallel PageRank algorithm based on subgraph division is proposed to increase the calculation tasks in map stage in Map Reduce process,reduce the network I/O data transmission,and improve the calculation efficiency of PageRank algorithm.Finally,the proposed algorithm is verified by crawling data.Experimental data show that the improved PageRank algorithm proposed in this paper has obvious advantages in page sorting compared with the traditional PageRank algorithm and VSM based PageRank algorithm,and the parallel PageRank algorithm based on subgraph division has obvious improvement in computing efficiency compared with the traditional parallel PageRank algorithm.
Keywords/Search Tags:PageRank, Distributed computing, Semantic similarity, HowNet, Subgraph Division
PDF Full Text Request
Related items