Research And Implementation Of Information Crawling System Based On Nutch

Posted on:2011-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Wu

Full Text:PDF

GTID:2178360308464747

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the explsive growth of network information, the speed of information crawling can not meet the needs of pratical applications. For the mass number of web pages needed to be crawled, how the information crawling system gets more web pages with good quality effectively, is related to the system performance. The decentralized status and dynamic changes of web information present troubles to information crawling. Information source may change at any time, so information crawling system has to refresh data in order to avoid failure pages.This paper aims to build the framework of information crawling system based on Nutch, and to discuss the strategies for improving system performance and quality. This paper focuses on the system design of ICSBN, the system performance improvement and to ensure the quality of web pages. We did research work on these and obtained the following results:1. The design and implementation of information crawling system based on NutchIn the main design, this paper uses the advantages of Hadoop and Nutch System to implement a scalable distributed parallel information crawling system. It uses MapReduce programming model to implement the basic five modules of the system, including injector, URL selector, fetcher, parser and updater. We did experimemts on cluster system to prove the system performance of different counts of nodes. The results showed that as the number of nodes increases, the system performance has linear growth. And that shows that the system has good scalability. System also has a small amount of communication between nodes, load balancing and so on. Compared with Nutch1.0 system tests, the system has higher speed and parallel speedup.2. The improvement of system performanceIn order to reduce the time consuming of domain name solution, and to reduce the burden of DNS server, this paper uses the strategy of hash function with URL balancing and optimized action hash strategy and multi-level cache to optimize the process of domain name solution. System implemented IP filter plug-in for customizing the fetching area of the system to improve system utilization and network utilization.3. Ensure the qulity of web pages crawledThis paper presents a strategy based on web page rank priority. The strategy presents a URL evaluation and fixed OPIC algotithm based on the change policy of web pages and page-type study. This strategy can not only discover new web pages in time, but also fetch web pages that changes all the time in time, which are the ensurements of refreshment of web page DB. On another way, the strategy can get the most important web pages in a situation of limited capacity and that is the true sense of crawling performance. In order to achieve the update of the paged crawled, we uses dynamic selecton prediction method to predict the time when to update the page. The dynamic selection strategy would choose different algorithms to predict the time when to update the page. The experiments based on these strategies have good results.

Keywords/Search Tags:

Informatin Crawling, Domain Name Solution, OPIC, Dynamic Selection

PDF Full Text Request

Related items

1	Research And Application Of Multi-agent System Optimization For Domain-oriented Dynamic Task Solving
2	Research On Customized Web Information Crawling And Pushing Techniques
3	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
4	Dynamic Characteristics Of Soft Crawling Robots
5	Research On Large-scale Crawling On Web Forums
6	Vertical Search Engine For Crawling The Web Page Design And Implementation
7	Research On Efficient Web Information Crawling Strategy
8	Research On The Focused Crawling Combining Synthetic Web-Page Information And Domain Ontology
9	Domain-Oriented Incremental Deep Web Crawling
10	Research On Extensible Hash Based Dynamic Load Balancing For Parallel Web Crawling