Font Size: a A A

Design And Implementation Of Focused Crawler

Posted on:2014-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:X M PengFull Text:PDF
GTID:2248330398470713Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
It is well known that Internet has become the carrier of a large amount of information over the last decade. With the development of high-speed network, how to use internet effectively and collect useful information rapidly and accurately have become a hot research. General search engine is just a simple search tool, it cannot meet personalized and professional search as the lapse of time. So topic search engine becomes a trend in the modern information retrieval. It can offer more rapid and more accurate retrieval service for the user, because it just searches the information which is related to the specific topic. Search engines make use of web crawler to get the network resource automatically in the internet, and make index over the page. It is helpful for the user to information retrieval. Therefore, web crawler plays an important role when the search engine gets the network data.This thesis firstly introduces the structure, work principle and shortcoming of the normal crawler. Then it shows the topical crawler structure and work principle. It analyzes carefully the topical crawler core technology, that is, topic search strategy and topic calculations.It has designed the parallel structure topic crawler system through carefully analyzing of the topic crawler principle. It pointed out the disadvantage of the Shark-Search algorithm and the weakness of HITS algorithm. It creates the topic crawler search strategy which combines Shark-Search algorithm search strategy with HITS algorithm search strategy. The existing topic crawlers need a lot of labeled samples to make offline training, which cannot learn more new information during the crawling and it cannot make full use of the downloaded page related the topic. So it is difficult to meet the demand of modern Web resources collecting. Learning the new downloaded page online can accelerate the topic crawler work and improve the accuracy. By the research of incremental bayes classifier algorithm in this thesis, it would be used in the topic relevant calculation of the topic crawler.Finally, implement the topic crawler under the circumstance of Linux by C++language. The result of the experiment shows that the crawler system has a good performance and can automatically collect information accurately.
Keywords/Search Tags:Web crawler, Topic Search, Crawling Strategies, Incremental Bayesian, Parallel Architecture
PDF Full Text Request
Related items