Research On Web Page Sorting Algorithm Of Web Structure Mining

Posted on:2012-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:C Yang

Full Text:PDF

GTID:2248330395455287

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid development of Internet, the various kinds ofinformation on the internet have also increased exponentially, which has greatlyaffected the behavior of people’s life and study. Generally，it is very difficult for theinternet users to find the information they exactly need without making use of specifictools. Under this background, the search engine technology was born. Web page sortingalgorithms which sort the authority values of the pages according to the connectionbetween the Web pages are the key technologies of the search engine technology. Aswell known, Google’s PageRank algorithm is a very classic and an efficient Web pagesorting algorithm, its main idea is based on the relation between page links to transferthe authority values. In terms of the mathematical model, PageRank algorithm is aMarkov random walk model. HITS algorithm is another classic algorithm which dividesthe pages into authoritative pages and hub pages, sorts pages using the mutualreinforcement between the values of the page authority and the hub values of pages.In this paper, first, the framework, principle and simple classification of the searchengines are introduced, and the related works on the Web data mining, especially onWeb structure mining are summarized. Then, PageRank algorithm and HITS algorithmare systematically analyzed in the Web structure mining, and their advantages anddisadvantages are discussed, and the main reason for their weaknesses is pointed out.Furthermore, two new algorithms which overcome the different drawbacks of PageRankalgorithm and makes use of the idea of HITS algorithm are proposed. The firstalgorithm redefines the transfer function of the pages according to their in-degree, deadlink rate and out-degree, and overcomes the topic-drift problem. The second algorithmconsiders the page’s last modified time factor and adds inherit values of PageRank toavoid discriminating new pages. Finally, a crawler program is designed according to thecharacteristics of the proposed algorithm, and simulations are made for the classicPageRank algorithm and the proposed algorithms. The results demonstrate theeffectiveness and feasibility of the proposed algorithm.

Keywords/Search Tags:

PageRank, Page Sorting Algorithm, In-degree, Dead link rate, Out-degree

PDF Full Text Request

Related items

1	Web Page Sorting Algorithms Based On The Analysis Of The Linking Structure
2	Search Engine Sorting Algorithms Based On The Relation Degree Of The Word
3	A Study Of Page Sorting Algorithm Based On User’s Habit
4	Study On Web Information Credibility Evaluation Method Based On Improved PageRank
5	Page Ranking Algorithm Based On Link Similarity Study
6	Design And Implementation Of The Focused Crawler System Based On Customized Domain Conceptions
7	The Research Of Improvement In Link-based PageRank Sorting Algorithm
8	The Research Of Improvement In Link-based Pagerank Sorting Algorithm
9	Collaborative Filtering Algorithm Research Based On Page Interest Degree
10	Research For Rough Set Models Under Simi-larity Connnection Degree Tolerance Relation