Font Size: a A A

Research On Topical Crawler Combining Web Page Content And Hyperlink

Posted on:2011-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:L B LuoFull Text:PDF
GTID:2178360305991779Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing amount of information on the Internet, the limitations of traditional search engines, such as low coverage, poor timeliness, inaccurate results, have become increasingly obvious. For the above, another search engine called vertical search engine has appeared which can be obtained the more satisfactory results than the traditional one within a certain range. Topical Crawler is the core part of Vertical Search Engine. The research on Topical Crawler has important significance on increasing network bandwidth utilization and saving hardware resources and improving the search efficiency.This article firstly introduces the basic principles of crawler, and then discusses the key technologies of Topical Crawler such as Chinese word segmentation, thematic judgment approach, the establishment of thematic vector, mainly focuses on the crawling strategies of the Topical Crawler. This paper describes each type of representative algorithms of crawling strategy, analyzes the advantages and disadvantages of these algorithms and proposes the improved algorithm method.This paper tries to improve the word weight calculation method of the traditional Vector Space Model:giving the different weight on the words of different locations; improving the unreasonable links of Hits algorithm and strengthening the relations with each other; when extending root set, if a website B has n of web pages which point to the web page A of another website, the weight of those links set 1/n, the weight of the other links are still set to 1. In view of the Shark-Search algorithm leading to "short-sighted" and Hits algorithm leading to "topic drift", it combines the advantages of both content-based Shark-Search algorithm and linking relation-based Hits algorithm and forms two new topical crawler algorithm:S-Hits algorithm and MT-Hits algorithm,and implements them. Experiments show that the new algorithms have certain effects.
Keywords/Search Tags:Vertical search engine, Topical Crawler, Crawling Strategy
PDF Full Text Request
Related items