Distributed Web Crawler System

Posted on:2011-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:W Hu

Full Text:PDF

GTID:2208360302970196

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the explosive growth of the Internet, Web has become a huge worldwide network of information services. According to CNNIC statistics, At the end of 2008, Only the total number of Chinese web pages are more than 160 billion, Increase of 90% over 2007. The growth rate of web pages and websites is basically the same. Faced with such a huge information base, How to retrieval of the information that we need fast and accurate? Search Engine has become one of the most important means of web information access.The number of indexed pages and page quality are important indicators of a search engine. Therefore, Web Crawler, as a primary component of search engine, is an important foundation for good search engine. At present, because of commercial confidentiality considerations, the various search engines' techology of Crawler are generally not open. The available literature is limited to summary introduction.The purpose of this paper is to study, design and implement a distributed Web Crawler system. Through analyze search engine's system composition to draw out the focus of this paper—Web Crawler. Detailed analysis of the basic principles of building a Web Crawler based on a tiny crawler system. Further thoroughly has analyzed crawler's core principle, through detailed analysis of crawling strategy of the system, re-visit strategies, courtesy issues and so on. This paper is designed with a practical architecture of distributed web crawler, then proposed a distributed co-crawling algorithm to solve the distributed crawling problems, and proposed an improved large-scale web page storage structure, it is able to meet massive random access, as well as the needs of the additon of massive pages. Finally, designed and implemented a distributed web crawler system, and gives the vision for the future of the crawler system.Specific work of this article are as follows:(1) Analysis of the crawler crawling strategy for the system, including crawling priority strategy, avoid repeat crawling strategy, focused on analyzing web page revisiting strategy and crawler courtesy issues.(2) Designed a practical architecture of distributed Web Crawler, to minimize the system communication overhead and administrative overhead while in the pursuit of load balancing.(3) Proposed a distributed co-crawling algorithm, according to RMI distributed systems development process, solve the problems of distributed crawling.(4) Proposed an improved large-scale web page storage structure. It is able to adapt to sequential access and random access to different needs.(5) Designed and implemented a distributed web crawler system, and a analysis of the crawler's running results from performance, scalability and load balancing, it achieve a very satisfactory results.

Keywords/Search Tags:

search engine, web crawler, crawling strategy, distributed systems, page base

PDF Full Text Request

Related items

1	Vertical Search Engine For Crawling The Web Page Design And Implementation
2	Research On Topical Crawler Combining Web Page Content And Hyperlink
3	The Design And Implementation Of Topical Search Engine
4	Research And Implementation Of Subject-oriented Dual-bound Web Page Crawling Methods
5	Research On Topic Web Page Crawling Strategy For Vertical Search Engine
6	The Theme Of The Search Engine Web Spider Search Strategy Study
7	Research On Web Crawling Technology In Image Search Engine
8	The Research Of Web Page Crawling Strategy For Topical Search Engine Based On Web Mining
9	Study On Focused Crawling Technique For Vertical Search Engine
10	Design Of A Parallel Web Crawling System