Font Size: a A A

Research On Web Page Sorting Algorithm Of Web Structure Mining

Posted on:2012-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:C YangFull Text:PDF
GTID:2248330395455287Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of Internet, the various kinds ofinformation on the internet have also increased exponentially, which has greatlyaffected the behavior of people’s life and study. Generally,it is very difficult for theinternet users to find the information they exactly need without making use of specifictools. Under this background, the search engine technology was born. Web page sortingalgorithms which sort the authority values of the pages according to the connectionbetween the Web pages are the key technologies of the search engine technology. Aswell known, Google’s PageRank algorithm is a very classic and an efficient Web pagesorting algorithm, its main idea is based on the relation between page links to transferthe authority values. In terms of the mathematical model, PageRank algorithm is aMarkov random walk model. HITS algorithm is another classic algorithm which dividesthe pages into authoritative pages and hub pages, sorts pages using the mutualreinforcement between the values of the page authority and the hub values of pages.In this paper, first, the framework, principle and simple classification of the searchengines are introduced, and the related works on the Web data mining, especially onWeb structure mining are summarized. Then, PageRank algorithm and HITS algorithmare systematically analyzed in the Web structure mining, and their advantages anddisadvantages are discussed, and the main reason for their weaknesses is pointed out.Furthermore, two new algorithms which overcome the different drawbacks of PageRankalgorithm and makes use of the idea of HITS algorithm are proposed. The firstalgorithm redefines the transfer function of the pages according to their in-degree, deadlink rate and out-degree, and overcomes the topic-drift problem. The second algorithmconsiders the page’s last modified time factor and adds inherit values of PageRank toavoid discriminating new pages. Finally, a crawler program is designed according to thecharacteristics of the proposed algorithm, and simulations are made for the classicPageRank algorithm and the proposed algorithms. The results demonstrate theeffectiveness and feasibility of the proposed algorithm.
Keywords/Search Tags:PageRank, Page Sorting Algorithm, In-degree, Dead link rate, Out-degree
PDF Full Text Request
Related items