Research Of Task Scheduling And AJAX Page Fetching On Distributed Crawler

Posted on:2016-09-29

Degree:Master

Type:Thesis

Country:China

Candidate:T Li

Full Text:PDF

GTID:2308330473455995

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, various data show explosive growth. At the same time, the demand for data collection is growing. As the effective way of data acquisition, web crawler has been widely applied to a large variety of systems, such as search engine, public opinion monitoring system and so on. But web crawler for small to medium-size system faces two difficult problems: for one thing, deploying crawler system on single machine slows down the speed of data acquisition and the existing open source distributed web crawler frame is complex and lack of flexibility; For another, although Ajax technology which exchanges the necessary data with server in an asynchronous way can improve response speed of user interface and bring good user experience, the traditional web crawling fails to get complete information of the web pages which apply Ajax technology. These missing data usually has a great value for research. In distributed web crawler for small to medium-size system, task scheduling algorithm directly influences the fetching efficiency of system. So this thesis focuses on the study of task scheduling strategy of distributed crawler and Ajax page fetching algorithm.In task scheduling strategy of distributed crawler, this thesis mainly studies the task scheduling algorithm under the master-slave architecture. In order to ensure the load balance and scalability, we propose an averaging load space algorithm based on consistent hashing. The algorithm adopts the different way to add virtual nodes from copying each machine node, in order to solve the load imbalance problem in the case of small number of machines. The center node uses the algorithm to schedule tasks on the basis of knowing the running state of the whole system, and adjusts task allocation of each machine after the number of machines changes. By experimental comparison, the algorithm is verified to increase the efficiency of load balance.As for Ajax page fetching, since one Ajax page contains many states, we firstly refer to a classic state flow graph to model Ajax pages. Then, we raise a new method of repeat states detection based on the change of the main content of page, and use the proposed method to train XPath features of valid elements. Finally, we apply the training results to fetch Ajax pages. By experimental comparison, this method is verified to have the ability to further reduce the total number of triggered events, shorten the consumed time in the case of obtaining all data, improving the efficiency of Ajax page fetching.Finally, we present the overall design of a distributed crawler system which support Ajax page fetching, and introduce modules of the center node and the crawling node in detail. By successfully applying the distributed crawler system to a network public opinion monitoring project, we verify the effectiveness of the proposed technologies.

Keywords/Search Tags:

distributed crawler system, task scheduling, consistent hashing, Ajax page, state flow graph

PDF Full Text Request

Related items

1	Storage Access Optimization In Distributed Graph Processing System Based On Consistent Hashing
2	Design And Implementation Of The Dynamic Crawler System Based On State Transition
3	Research And Implement Of Distributed Crawler System Supporting AJAX
4	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
5	Research On Task Scheduling In The Parallel Distributed Real-Time Simulation Platform
6	Research On Performance And Power Consumption Optimization Of Distributed Cache System Based On Consistent Hash
7	Design And Implementation Of Customized Distributed Web Crawler
8	Research On CPS Distributed Task Scheduling Algorithm Based On DAG Model
9	Design And Implementation Of An Ajax-supported DEEP WEB Crawlershanghai Jiao Tong University
10	Design And Implementation Of Distributed Web Crawler System Based On Scrapy