Font Size: a A A

Research Of Task Scheduling And AJAX Page Fetching On Distributed Crawler

Posted on:2016-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:T LiFull Text:PDF
GTID:2308330473455995Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, various data show explosive growth. At the same time, the demand for data collection is growing. As the effective way of data acquisition, web crawler has been widely applied to a large variety of systems, such as search engine, public opinion monitoring system and so on. But web crawler for small to medium-size system faces two difficult problems: for one thing, deploying crawler system on single machine slows down the speed of data acquisition and the existing open source distributed web crawler frame is complex and lack of flexibility; For another, although Ajax technology which exchanges the necessary data with server in an asynchronous way can improve response speed of user interface and bring good user experience, the traditional web crawling fails to get complete information of the web pages which apply Ajax technology. These missing data usually has a great value for research. In distributed web crawler for small to medium-size system, task scheduling algorithm directly influences the fetching efficiency of system. So this thesis focuses on the study of task scheduling strategy of distributed crawler and Ajax page fetching algorithm.In task scheduling strategy of distributed crawler, this thesis mainly studies the task scheduling algorithm under the master-slave architecture. In order to ensure the load balance and scalability, we propose an averaging load space algorithm based on consistent hashing. The algorithm adopts the different way to add virtual nodes from copying each machine node, in order to solve the load imbalance problem in the case of small number of machines. The center node uses the algorithm to schedule tasks on the basis of knowing the running state of the whole system, and adjusts task allocation of each machine after the number of machines changes. By experimental comparison, the algorithm is verified to increase the efficiency of load balance.As for Ajax page fetching, since one Ajax page contains many states, we firstly refer to a classic state flow graph to model Ajax pages. Then, we raise a new method of repeat states detection based on the change of the main content of page, and use the proposed method to train XPath features of valid elements. Finally, we apply the training results to fetch Ajax pages. By experimental comparison, this method is verified to have the ability to further reduce the total number of triggered events, shorten the consumed time in the case of obtaining all data, improving the efficiency of Ajax page fetching.Finally, we present the overall design of a distributed crawler system which support Ajax page fetching, and introduce modules of the center node and the crawling node in detail. By successfully applying the distributed crawler system to a network public opinion monitoring project, we verify the effectiveness of the proposed technologies.
Keywords/Search Tags:distributed crawler system, task scheduling, consistent hashing, Ajax page, state flow graph
PDF Full Text Request
Related items