Font Size: a A A

Architecture And Optimization Of Prallel Crawler

Posted on:2007-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhaoFull Text:PDF
GTID:2178360212477073Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A search engine is a shortcut to find information from Internet. A crawler is an important component of a search engine. It is responsible for web information gathering. It is the only source of the raw data in search engine database. This paper aimes at web searching, a cutting edge technology, and investigates the related theory and technology in detail. A high performance parallel crawler is designed and implemented based on this knowledge.The research work mainly includes following outlines.Firstly, this dissertation proposes basic architectures for a parallel crawler and identifies some fundamental issues related to parallel crawling, including the algorithm of task assignment, system internal communication and collaboration. Based on these understandings, a centralized architecture is proposed for ChaoCrawler. A collaboration mechanism based on NFS and its concurrent data processing solution is also introduced in this dissertation.Aiming at the practical problems a parallel crawler will face to, this paper advances three types of optimization policy for ChaoCrawler, including collision avoidance, URL indexing and DNS caching. The collision avoidance algorithm combines URL hashing and site-name hashing. It realizes work load balancing and also avoids the collision when parallel fetching. URL indexing is implemented by indexing the URL checksums. A URL index database, which has two indexing algorithm, Hash and B+ tree, is built based on Berkeley DB. It satisfies the needs of the parallel crawler. DNS caching is a client cache method. It adopts full-cache policy. The DNS bottle-neck is solved with its help and the performance of the parallel crawler is greatly improved.Finally, a parallel crawler named ChaoCrawler is designed and implemented based on these methods. Experiments are also taken to examine the performance of ChaoCrawler. The effect of the three optimization method is validated by the experiments.
Keywords/Search Tags:search engine, information retrieval, crawler, parallel, index
PDF Full Text Request
Related items