Font Size: a A A

Research On Key Technologies Of Distributed Web Crawling

Posted on:2012-08-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X XuFull Text:PDF
GTID:1118330362950179Subject:Information security
Abstract/Summary:PDF Full Text Request
In the past 20 years, in order to keep up with the growth of the Web, the Web crawling system has evolved from single-machined system to multi-machined system, and from centralized system to decentralized and distributed system. According to this trend, Web partition, task scheduling and node coordination has become the key issues involved in building an efficient large scale Web crawling system. Although people from both academia and industry have contributed a number of works, as the Web is continuously expanding itself, the requirement always exceeds the capacity of the solutions. Based on the existing works and ideas, this dissertation conducts further researches aiming to find effective and efficient solutions to the critical issues.Although the research concerning Web search engine has lasted many years, no one has given a complete cost model of the entire system, which leads to misunderstandings in choosing the system's design scheme. To fill this gap, we propose a cost model for a typical Web search engine system which consists of a Web crawling system and an indexing system. 5 different design schemes are characterised according to this model and are compared through power consumption, networking cost, system scale and query efficiency. The half-WAN scheme which consists of a WAN-based crawling system and a multi-cluster indexing system is proved to be the best design choice of a large-scale highly-efficient Web search engine.According to the cost model, if the download time of each Web page can be shortened, then the system cost can be lowered. In order to shorten the download time, we propose the concept of NDOWP (Network-distance-oriented Web Partition). Firstly, based on a large number of real Web tests, we propose to use Median-RTT (Median-Round-Trip-Time) as a metric to measure the network distance between each crawler-host pair. Then, based on the network coordinate technology, we put forward the NC-CAN (Network-Coordinate and Content-Addressable-Network) Web Partition algorithm. The algorithm well shortens the average distance each crawler-host pair, thus increasing the total download speed of the distributed crawling system.We further conduct studys on task scheduling in WAN-based Web crawling system aiming to maximum the benefit of Web partition and overcome its shortcomings. We propose a forwarding-based load balancing algorithm called LBI. Experiments show that the system's throughput can be maximized using the algorithm with NC-CAN. We also propose a Web host partition algorithm which can reduce the task granularity thus further improving LBI's effciency.Second, running a distributed Web crawler also causes a huge number of inter-crawler communications, which comsumes the crawler's number of connections, bandwidth and other local resources. We recognize that the inter-link exchange consists of most of these communications; then, we propose a method called Link-Coordinate to model the logical distance between each Web host pair through inter-links; with a DHT-based Web partition scheme we then split the set of Web hosts into subsets each of which has minimium inter-links connected to the others. By assigning the subsets to the crawlers, we can dramatically reduce the inter-crawler communications.To achieve high robustness, high efficiency and low cost, based on our previous study, we propose the design scheme of a WAN-based distributed Web crawling system. The system is actually a overlay network which is composed of various computational resources including personal computers all over Internet. On one hand, the adoption of the P2P concept allows the system to scale itself according to the growth of the Internet and the Web; on the other hand, Web partition schemes and dynamic task scheduling can dramatically reduce the download time of each Web page and increase system throughput and update rate; moreover, the design scheme matches the half-WAN scheme, which is proved to have low system cost.
Keywords/Search Tags:Web, search engine, Web crawling, distributed computing, network distance, network coordinate, Content-Addressable-Network
PDF Full Text Request
Related items