Font Size: a A A

Design And Implementation Of Cloud Crawler Subsystem Of Cloud Data Collecting System

Posted on:2020-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:P F HuangFull Text:PDF
GTID:2428330572973552Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and data mining,the data in Internet are being more and more valuable.The web crawler is hard to use and customized for collect data from Internet.This paper applies cloud computing to web crawler,designs and implements of Cloud Web Crawler Subsystem(CWCS)in Cloud Data Collect System which is based on Software as a Service(SaaS).Tenants can make data collecting mission in the specific form for their demand,then just execute the mission using independent distributed web crawler service provided by CWCS.This paper mainly investigates tow critical problems in CWC.S including node management and URL crawling task allocation for the combination of web crawler and SaaS.This paper provides a node management solution using etcd,which enables mixed deployment and node interchange by defining a series of procedure for all web crawler node in subsystem.This solution supports updating the setting of running node,the scaling of running crawler cluster and timely perception of failure node,ensuring the reliability of crawler cluster service.In terms of URL crawling task allocation,this paper proposes a solution OJCH,which is based on jump consistent hash algorithm(JCH).The OJCH uses JCH to calculate node of tasks keeping the performance of JCH,and overcomes the shortcoming of JCH by using failure node rehashing,which are checked by experiment.Besides,this paper provides a URL crawling tasks deduplication solution that supports recurrent tasks.After that,this paper gives the design of CWCS and the design&implement of module in CWCS including cluster control module,web service module,task queue module,task scheduler module,task process module and node management module.Then it tests the CWCS according related test cases and the CWCS passes the test.Finally,it states a summary of this paper.
Keywords/Search Tags:cloud web crawler, consistent hashing, load balancing, cloud computing
PDF Full Text Request
Related items