Design And Implementation Of Focused Crawler

Posted on:2014-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:X M Peng

Full Text:PDF

GTID:2248330398470713

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

It is well known that Internet has become the carrier of a large amount of information over the last decade. With the development of high-speed network, how to use internet effectively and collect useful information rapidly and accurately have become a hot research. General search engine is just a simple search tool, it cannot meet personalized and professional search as the lapse of time. So topic search engine becomes a trend in the modern information retrieval. It can offer more rapid and more accurate retrieval service for the user, because it just searches the information which is related to the specific topic. Search engines make use of web crawler to get the network resource automatically in the internet, and make index over the page. It is helpful for the user to information retrieval. Therefore, web crawler plays an important role when the search engine gets the network data.This thesis firstly introduces the structure, work principle and shortcoming of the normal crawler. Then it shows the topical crawler structure and work principle. It analyzes carefully the topical crawler core technology, that is, topic search strategy and topic calculations.It has designed the parallel structure topic crawler system through carefully analyzing of the topic crawler principle. It pointed out the disadvantage of the Shark-Search algorithm and the weakness of HITS algorithm. It creates the topic crawler search strategy which combines Shark-Search algorithm search strategy with HITS algorithm search strategy. The existing topic crawlers need a lot of labeled samples to make offline training, which cannot learn more new information during the crawling and it cannot make full use of the downloaded page related the topic. So it is difficult to meet the demand of modern Web resources collecting. Learning the new downloaded page online can accelerate the topic crawler work and improve the accuracy. By the research of incremental bayes classifier algorithm in this thesis, it would be used in the topic relevant calculation of the topic crawler.Finally, implement the topic crawler under the circumstance of Linux by C++language. The result of the experiment shows that the crawler system has a good performance and can automatically collect information accurately.

Keywords/Search Tags:

Web crawler, Topic Search, Crawling Strategies, Incremental Bayesian, Parallel Architecture

PDF Full Text Request

Related items

1	Design Of A Parallel Web Crawling System
2	Crawling Search Strategy Subject-oriented Research And Realized
3	The Design And Research Of Topic Web Crawler In Vertical Search Engine
4	The Study On Incremental Crawling Of Web Fourms
5	Research And Implementation Of Domain-Specific Topic Search System
6	Research On Efficient Web Information Crawling Strategy
7	Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search
8	Research On Topic Search And Its Key Algorithm
9	Research On Web Crawling Strategies
10	Design And Implementation Of Topic-specific Web Crawler