Font Size: a A A

Distributed Web Crawler System

Posted on:2011-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:W HuFull Text:PDF
GTID:2208360302970196Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Internet, Web has become a huge worldwide network of information services. According to CNNIC statistics, At the end of 2008, Only the total number of Chinese web pages are more than 160 billion, Increase of 90% over 2007. The growth rate of web pages and websites is basically the same. Faced with such a huge information base, How to retrieval of the information that we need fast and accurate? Search Engine has become one of the most important means of web information access.The number of indexed pages and page quality are important indicators of a search engine. Therefore, Web Crawler, as a primary component of search engine, is an important foundation for good search engine. At present, because of commercial confidentiality considerations, the various search engines' techology of Crawler are generally not open. The available literature is limited to summary introduction.The purpose of this paper is to study, design and implement a distributed Web Crawler system. Through analyze search engine's system composition to draw out the focus of this paper—Web Crawler. Detailed analysis of the basic principles of building a Web Crawler based on a tiny crawler system. Further thoroughly has analyzed crawler's core principle, through detailed analysis of crawling strategy of the system, re-visit strategies, courtesy issues and so on. This paper is designed with a practical architecture of distributed web crawler, then proposed a distributed co-crawling algorithm to solve the distributed crawling problems, and proposed an improved large-scale web page storage structure, it is able to meet massive random access, as well as the needs of the additon of massive pages. Finally, designed and implemented a distributed web crawler system, and gives the vision for the future of the crawler system.Specific work of this article are as follows:(1) Analysis of the crawler crawling strategy for the system, including crawling priority strategy, avoid repeat crawling strategy, focused on analyzing web page revisiting strategy and crawler courtesy issues.(2) Designed a practical architecture of distributed Web Crawler, to minimize the system communication overhead and administrative overhead while in the pursuit of load balancing.(3) Proposed a distributed co-crawling algorithm, according to RMI distributed systems development process, solve the problems of distributed crawling.(4) Proposed an improved large-scale web page storage structure. It is able to adapt to sequential access and random access to different needs.(5) Designed and implemented a distributed web crawler system, and a analysis of the crawler's running results from performance, scalability and load balancing, it achieve a very satisfactory results.
Keywords/Search Tags:search engine, web crawler, crawling strategy, distributed systems, page base
PDF Full Text Request
Related items