Font Size: a A A

Research On Dynamic Load Balancing Method Of Distributed Crawler System

Posted on:2015-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z H FuFull Text:PDF
GTID:2308330479989736Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, search engine has become the first entrance to find information for Internet users. Web crawler, as a core component of search engine, used to collect information on the Internet. Now the network information is expanding constantly in exponential speed. The comprehensiveness and the real-time performance of the web page collection have become more and more difficult, which brings a huge challenge for the crawler system.The core issues of the crawler system research are how to make full use of the computer hardware resources and network bandwidth to collect web page efficiently and reduce the communication caused by determining whether url is repeating or not. Aiming at this target, this paper mainly focuses on how to improve crawler performance, to study the dynamic load balancing of distributed crawlers. The main achievements of this paper can be summarized as follows:After studying the advantages and disadvantages of the distributed system, task scheduling of distributed system, crawler work principle and crawling strategy, and combined with the characteristics of crawler, structure features of Internet and the similarity of website’s page, propose the algorithm of predicting site-scale dynamically based on online feedback. The algorithm, at first, categorizes the site-scale, then puts forward the growth rate of website, calculates the growth rate in the process of the crawling to predict the scale of website step by step. Then, the paper trains and validates the model of the algorithm based on the data that collected.The algorithm of predicting site-scale dynamically based on online feedback is applied to the system of distributed crawler to predict the scale of website in the process of crawling. Based on the predicted scale of website can reduce the communication caused by determining whether url is repeating or not. According to the scale of website and the the crawler loading to calculate the load factor, then distributed crawler schedules tasks to achieve dynamic lo ad balancing based on the load factor.The system of distributed crawler has good robustness and scalability, and supports the extension of crawler and deals with the failed crawler quickly.At last, this paper introduces the design issues of a distributed crawler and some key modules’ implementation.
Keywords/Search Tags:distributed crawler, dynamic load balancing, predicte site-scale, reduce communication, task scheduling
PDF Full Text Request
Related items