Design And Implement Of Wide-area Distributed Crawler System Based On Actor Model

Posted on:2017-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:L P Chen

Full Text:PDF

GTID:2348330518496512

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The rapid development of computer technology and network technology has brought us into a new era of Internet.Internet users can generate a lot of behavior data on the Internet every day.Based on the Actor model,this paper has implemented a basic framework for obtaining a large number of public data on the Internet.This system provides an integrated service for the developer of the crawler through building a complete set of distributed web crawler framework.Currently,thanks to the active of the open source community,there has been a number of relatively mature open source crawler systems,among which the Java version of the Heritrix/Nutch and Python version of Scrapy is more familiar.Most of these frameworks are complete.But there are also some problems.In certain circumstances,they may not be the best choice.And due to the long-term development,the size of its source code is very large,which makes it more difficult to diagnose in the presence of problems.In addition,their support for distributed system are relatively weak,or dependent on other distributed frameworks.In the current conditions of adequate machine resources,it is relatively thin.Under this background,the thesis has implemented an Internet distributed crawler framework,which is used by the crawler developers.Based on this framework,developers can complete a distributed crawler task efficiently and rapidly.According to the functional and performance requirements of actual crawler task,the overall design scheme is formulated and a complete framework is designed.In general,the task to implement the system is divided into five modules,including the master module,slave module,client module,worker module and storage module.The master module is responsible for the overall operation of the framework;the slave module is responsible for establishing process in the slave node;the worker modules is responsible for specific page crawling,parsing and storing;the client module is responsible for the submit,operation and management of jobs;the back-end storage module provides data storage.Each module is independent in the usage and runtime.Communication between modules are mainly by the way of HTTP.Each module is also composed by various sub modules and communication between the sub modules in a module is mainly based on the Actor model.This thesis gives a comprehensive analysis on the system from the detailed design and implementation.And at last,the function and performance of this system are tested.Finally,this paper makes a summary and outlook on the system.Meanwhile,some feasible improvement scheme are proposed.

Keywords/Search Tags:

distributed, crawler, Actor model, framework

PDF Full Text Request

Related items

1	Besearch And Implementation Of Periodic Large-scale Distributed Task Management Framework Based On Actor Model
2	Design And Implementation Of Customized Distributed Web Crawler
3	Research On Distributed And Focused Web Crawler Technology And Algorithms
4	Design And Implementation Of Orleans-based Block Storage System
5	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
6	Design And Implementation Of A Distributed Dynamic Web Crawler System
7	Research On Topic Focused Web Crawler And Related Technologies
8	Design And Implementation Of A Distributed Crawler System Based On Scrapy Framework
9	Design And Development Of Distributed Crawler Based On Scrapy Framework
10	The Design And Implementation Of A Distribute Service Framework System