Font Size: a A A

Design And Implement Of Wide-area Distributed Crawler System Based On Actor Model

Posted on:2017-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:L P ChenFull Text:PDF
GTID:2348330518496512Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of computer technology and network technology has brought us into a new era of Internet.Internet users can generate a lot of behavior data on the Internet every day.Based on the Actor model,this paper has implemented a basic framework for obtaining a large number of public data on the Internet.This system provides an integrated service for the developer of the crawler through building a complete set of distributed web crawler framework.Currently,thanks to the active of the open source community,there has been a number of relatively mature open source crawler systems,among which the Java version of the Heritrix/Nutch and Python version of Scrapy is more familiar.Most of these frameworks are complete.But there are also some problems.In certain circumstances,they may not be the best choice.And due to the long-term development,the size of its source code is very large,which makes it more difficult to diagnose in the presence of problems.In addition,their support for distributed system are relatively weak,or dependent on other distributed frameworks.In the current conditions of adequate machine resources,it is relatively thin.Under this background,the thesis has implemented an Internet distributed crawler framework,which is used by the crawler developers.Based on this framework,developers can complete a distributed crawler task efficiently and rapidly.According to the functional and performance requirements of actual crawler task,the overall design scheme is formulated and a complete framework is designed.In general,the task to implement the system is divided into five modules,including the master module,slave module,client module,worker module and storage module.The master module is responsible for the overall operation of the framework;the slave module is responsible for establishing process in the slave node;the worker modules is responsible for specific page crawling,parsing and storing;the client module is responsible for the submit,operation and management of jobs;the back-end storage module provides data storage.Each module is independent in the usage and runtime.Communication between modules are mainly by the way of HTTP.Each module is also composed by various sub modules and communication between the sub modules in a module is mainly based on the Actor model.This thesis gives a comprehensive analysis on the system from the detailed design and implementation.And at last,the function and performance of this system are tested.Finally,this paper makes a summary and outlook on the system.Meanwhile,some feasible improvement scheme are proposed.
Keywords/Search Tags:distributed, crawler, Actor model, framework
PDF Full Text Request
Related items