| With the explsive growth of network information, the speed of information crawling can not meet the needs of pratical applications. For the mass number of web pages needed to be crawled, how the information crawling system gets more web pages with good quality effectively, is related to the system performance. The decentralized status and dynamic changes of web information present troubles to information crawling. Information source may change at any time, so information crawling system has to refresh data in order to avoid failure pages.This paper aims to build the framework of information crawling system based on Nutch, and to discuss the strategies for improving system performance and quality. This paper focuses on the system design of ICSBN, the system performance improvement and to ensure the quality of web pages. We did research work on these and obtained the following results:1. The design and implementation of information crawling system based on NutchIn the main design, this paper uses the advantages of Hadoop and Nutch System to implement a scalable distributed parallel information crawling system. It uses MapReduce programming model to implement the basic five modules of the system, including injector, URL selector, fetcher, parser and updater. We did experimemts on cluster system to prove the system performance of different counts of nodes. The results showed that as the number of nodes increases, the system performance has linear growth. And that shows that the system has good scalability. System also has a small amount of communication between nodes, load balancing and so on. Compared with Nutch1.0 system tests, the system has higher speed and parallel speedup.2. The improvement of system performanceIn order to reduce the time consuming of domain name solution, and to reduce the burden of DNS server, this paper uses the strategy of hash function with URL balancing and optimized action hash strategy and multi-level cache to optimize the process of domain name solution. System implemented IP filter plug-in for customizing the fetching area of the system to improve system utilization and network utilization.3. Ensure the qulity of web pages crawledThis paper presents a strategy based on web page rank priority. The strategy presents a URL evaluation and fixed OPIC algotithm based on the change policy of web pages and page-type study. This strategy can not only discover new web pages in time, but also fetch web pages that changes all the time in time, which are the ensurements of refreshment of web page DB. On another way, the strategy can get the most important web pages in a situation of limited capacity and that is the true sense of crawling performance. In order to achieve the update of the paged crawled, we uses dynamic selecton prediction method to predict the time when to update the page. The dynamic selection strategy would choose different algorithms to predict the time when to update the page. The experiments based on these strategies have good results. |