| With the development of the Internet, the Internet sites and data become more and more huge and complex. We require the Internet information more than last, and often depend on search engine. As search engine’s data source, web crawler plays an important role. Some of web crawler’s indicators, such as crawling speed, coverage, page rank, index, and real-time, etc. directly affect the search results.Meanwhile, the requirement of deep integration information is widespread. So far, many companies, organizations and individuals continue to research and develop new crawler, especially theme crawler. In the enterprise, the information which crawled by web crawler can be multidimensional show as the data warehouse’s data source, and it also can be used as the source of data mining. For example, the opinion monitoring system needs to collect relevant information from the Internet. Real estate business use crawler to crawl the relevant real estate information in order to make decision and analysis. Especially, some people use crawler to mine information and gather intelligence from the Internet.However, the traditional crawler which runs on a single computer is difficult to cope with the challenges which bring by the rapid growth of information. And it is difficult to grab massive amounts of data quickly and effectively. Distributed technology supports large clusters, massive shared storage space. It can take advantage of each node’s CPU, and increase the total computing power. And it has greater total bandwidth. It overcomes the crawler’s efficiency problem fundamentally, and solves the IT operating costs. Because distributed technology depends on cheap personal computers, instead of expensive server machines.The paper analyzed the crawler’s principles, workflow, crawling strategy, web analytic methods and other related theories based on the Internet web site’s structure and web page’s principles. For the sake of improving crawl efficiency, optimize web crawler by using the distributed cluster feature of Hadoop. Design and implement a configurable, high-performance, load balance, and scalability distributed web crawler prototype system based on Hadoop. Set forth and analyzed the system’s architecture, implementation solution and the design and implementation of several key modules by combining with distributed cluster technology. And give solutions to several key technical issues. These issues include the design of URL queue, massive duplicated URL removal, multi-threaded parallel crawling, web pages’ incremental update and dynamic web page analytic. At last, analyze and test the crawling performance. |