Font Size: a A A

Optimizing Page Ranking Based On Link Analysis

Posted on:2013-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:J N LiFull Text:PDF
GTID:2248330371483036Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search engines can pick out the most relevant collection of pages from the hundreds ofmillions of web pages, making it more convenient for people to use the World Wide Web. Asthe core component of a search engine, the page ranking algorithm determines in what orderthe search results should be presented to users and its performance will directly affect theservice quality and the user’s search experience. Currently, search engines are facingincreasingly serious web spam problem-using the defects of the existing ranking algorithm,web creator improve the page’s rank by deceive methods. This kind of web page always haslow quality and even contains false information that impacts on search service seriously.Web spam has become one of the key issues that search engine research needs to address.Under this circumstance, the paper researches on the page rank of cheating, spam detectionand suppression methods and proposes a new ranking algorithm based on the analysis andsummary of existing methods. The main work and innovation of this paper are summarizedas follows:(1) After analyze two kinds of page ranking algorithms based on the content and thelink respectively, we introduce the cheating method for the two algorithms and deeplyanalyze the main spam detection methods based on the principle of trust spreading and so on.(2) Propose a new method to measure similarity between web pages. Based on the"inlink" and "outlink" of pages, this method draw on the "social circle" conception in thesocial network analysis to detect and suppress web spam. Taking the time complexity intoconsideration, accurate computation is too time consuming to handle large-scale World WideWeb. So the paper comes up with a approximate computation based on probabilisticcounting algorithm to reduce the time complexity and space complexity of this measuremethod from O(n3logn) and O(n2) to O(n2) and O(n) respectively, where n is the number ofnodes. The experiment results show that this approximate method can significantly reducethe computational expense by a small loss of computing precision.(3) Propose the ISA-PR(Inlink Source Analysis based PageRank) algorithm that ranksthe web site based on inlink source analysis. The basic idea of this algorithm is that theexisting page ranking and spam detection methods just consider the number of inlinks, while ignoring the source of inlinks-an important information to evaluate the authority of a webpage objectively. Compared to the real authoritative page(having a large number inlinks froma wide variety of sources), pages improving ranks by cheating methods often don’t have thecharacteristic that their inlinks come from a wide variety of sources. Based on the abovethinking, we propose the method to judgment the extensity of inlink source and the methodto adjust link weight respectively. Then combining with PageRank, we propose the algorithmISA-PR. The experiment results on several benchmark data sets show that ISA-PR performswell to find high-quality pages and suppress web spam.
Keywords/Search Tags:Search engine, Page ranking, Web spam, Link analysis, Probabilistic counting algorithm
PDF Full Text Request
Related items