Font Size: a A A

An Improved Algorithm Of Page Rank Based On Various Factors

Posted on:2020-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:T W LinFull Text:PDF
GTID:2428330578964615Subject:Agriculture
Abstract/Summary:PDF Full Text Request
Search engine has become one of the most commonly used means for people to find information,and the most critical technology in search engine is webpage sorting algorithm.Currently,the most widely used and successful sorting algorithm is the PadeRank sorting algorithm based on link structure.The successful use of PageRank algorithm in search engines indicates that it is efficient and feasible,but the link analysis will lead to the theme drift problem and ignore the user personalization problem.Therefore,this paper proposes an improved PagRank algorithm based on topic relevance and user history.The drift and neglect of user personalization issues propose the following two improvements to the PageRank algorithm:(1)TextRank is used to parse the content of the retrieved webpage,and then the keywords of each page are extracted.Combining Word2 Vec and VSM technologies,the similarity between search words and keywords of webpage is calculated to determine the correlation between webpage and search words,and weight is assigned based on this to improve the theme drift existing in the algorithm.(2)The search history factor is introduced to record users' search words,and the similarity between historical search words and web keywords is calculated in this search,so that the algorithm can pay attention to user personalization in the weight allocation process and improve the satisfaction of users' queries.THPR algorithm no longer distributes weights equally,but measures the size of weight distribution from two aspects,which to some extent makes up for the defects of the original algorithm.In order to verify the performance and efficiency of the improved algorithm THPR,an experimental system is set up in this paper.The experimental data is obtained through the open source crawler Heritrix.After analyzing the overall structure of the crawler,the crawling logic of Heritrix crawler is improved so that it can climb certain contents.After de-noising and filtering,the datas are stored in the original web database.The website in the database will be analyzed,the rank of each page will be calculated,and the index will be established.The keyword of the search will be entered in the input interface and the corresponding pages will be returned.The experimental results show that compared with the PagRank algorithm,the user satisfaction of the THPR algorithm has improved,and the comprehensive evaluation rate has also increased by about 6%.Moreover,the THPR algorithm has advantages over the improved algorithm based on anchor text,the improved algorithm based on time weight and the improved algorithm based on user behavior.
Keywords/Search Tags:PageRank, THPR, topic-relative, search history
PDF Full Text Request
Related items