Font Size: a A A

Research Of PageRank Algorithm Optimization Based On Forestry Theme

Posted on:2018-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiFull Text:PDF
GTID:2333330566950400Subject:Forestry Information Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,the forestry information technology has entered the stage of "intelligent forestry" in our country.In the context of big data,the network data information is faced with exponential growth.People need to get forestry information on the network with the more efficient and accurate method.However,the value of the data obtained by using the traditional ranking method is not high,more irrelevant information,and the customer cannot seize the key point,therefore it cannot meet the demand of searching information on forestry related fields.In order to improve the ranking of web pages with high degree of relevance to the theme and make the more in line with the forestry theme presented in front of the user firstly,the traditional web page ranking algorithm is improved in this paper.Through in-depth study and research on the web page ranking algorithm,the PageRank web page ranking algorithm is taken as the research foundation in this paper,and the crawler tool is used to grab a large amount of information and establish the corresponding links between the web pages.Because of the existence of theme shift in the traditional topic drift sorting algorithm,and in the retrieval process of decision problems such as lack of retrieval words important degree,the original PageRank algorithm combined with the weights of text topic,a kind of order scheduling algorithm FT-PageRank based on the weight of forestry theme is put forward in this paper.By manual classification and processing of training set through the algorithm,the SVM classification model of forestry topic was obtained.According to the relationship between text vector newly acquired and forestry theme classification model,judge the similarity with the forestry theme,and calculate the weights of forestry theme value.As a part of the parameters of the PageRank algorithm,the weight value vector is used as an iterative calculation.For the reason that the structure of the web page is too large to be quick calculation,the improved algorithm is deployed on the MapReduce parallel computing framework to be parallel improved and promote the data processing efficiency.In addition,in the acquisition process,the information extraction in the line with the label in the form of a web page according to the semi structural features of Web documents.For a keyword in the web page text in the different position,will bring different theme expression ability.Therefore,the method of calculating the weight of the word position is introduced in this paper,which can be used to assign the weight of the word position in the FT-PageRank score.And according to the score matching page ranking,make the ranking results more in line with the user's retrieval needs.The experimental results show that the improved method in forestry topic representation has good effects,"the high degree of access" and the clear theme will get higher page rank,ranking the overall theme accuracy is improved.Parallelized through MapReduce framework,with the increases of data amount the execution efficiency is obviously improved.Finally,the system prototype is designed,and the algorithm has practical application value.
Keywords/Search Tags:Web ranking, PageRank algorithm, Forestry topic weight, Word location
PDF Full Text Request
Related items