Font Size: a A A

Study Of HITS Algorithm In Web Hyperlink Analysis

Posted on:2007-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:F F LiuFull Text:PDF
GTID:2178360212457157Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web is an enormous information resources bank, which provides various kinds of information services. As the prevalence of Web and the quick expansion of Web information, how to acquire information that we need from Web becomes more and more important. Therefore, discovering valuable information from distributive Web environment and acquire knowledge from it has became important task of the information research and data mining field at present. Users hope to get not only the relevant Web pages, but also pages searched with high quality, that's to say to find out authority pages. Page's hyperlink is an important method for it, and the introduction and application of hyperlink analysis provide a wholly new approach to solute those problems. HITS is a widely used authority source distilling algorithm which based hyperlink analysis and has high value for study.This paper introduces the Web hyperlink analysis briefly, based on it analysis the good points and bad points of HITS, then compares HITS with the classical algorithm PageRank. Through the study of the topic drift, this paper addresses a new improved algorithm that settles the topic drift problem well. HITS orders documents just by in-degree and out-degree of pages, in some cases problems may appear, the page originally collected which called root set needs to be expanded to base set, during the expansion, plentiful pages which irrelative with the topic are added. If the pages link close, topic drift occurs and causes that the authoritative pages got by HITS is not users' expectation. Aiming at this problem, this paper analyses the topic distillation problem profoundly and advances a new improved algorithm, it improves HITS using max flow algorithm reasonably. Because the improved algorithm involves the knowledge of the maximal flow problem, this paper also addressed discovering community method based on maximal flow and the symbol algorithm.At last, this paper compares the improved HITS algorithm with the original HITS algorithm via our experiment. Through web crawler obtains data, then carries out the original HITS and MCHITS algorithm, and implements the symbol algorithm, the experimental results show that the improved algorithm has a big improvement about the returning result, decreased the probability of the topic drift.
Keywords/Search Tags:Hyperlink analysis, HITS, Maximal flow
PDF Full Text Request
Related items