Design And Application Of Distributed Crawler System Based On Micro-Service Architecture

Posted on:2021-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Ge

Full Text:PDF

GTID:2428330614465935

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of social economy,Internet applications have become a part of people's daily life,and it is becoming more and more important to obtain network application data in a timely and accurate manner.As a computer technology that meets the specific data needs of users through accurate information capture of Internet data,web crawlers also face huge opportunities and changes.Under such technological background,early web crawler technology has been already unable to assume this task.The thesis analyzes the deficiencies of existing crawler systems in technical implementation,and designs a distributed crawler system based on microservice architecture based on the massive page data crawling requirements which use Spring Cloud framework,this crawler system achieves architectural isolation between different microservice modules;message middleware or remote procedure call is used for data communication between different microservices to improve the availability of distributed crawler systems.The main work of this thesis includes designing and implementing a globally unique ID generation algorithm in a distributed system environment;URL double check based on Bloomfilter and Redis;client load balance algorithm strategy in distributed scenario;the limit algorithm strategy to deal with massive requests in the distributed scenario;use multi thread crawling based on the thread pool and dynamic proxy pool technology to deal with anti-crawler to improve the crawling efficiency and success rate of the crawler;the page resolution microservice uses CSS selector based custom page data extraction and consumer end prevention of repeated consumption mechanisms based Redis;using database replication and data sharding based Mongo DB to deal with massive data storage scenarios,Redis sentinel cluster and persistent storage to ensure high availability.Through experimental testing,the distributed crawler system based on the micro-services architecture can not only carry the crawler requests of a large number of users,but also meet the different data extraction requirements from the users.At the same time,the maintainability and expandability of the system are stronger than the traditional crawler system,which meets the system design requirements.

Keywords/Search Tags:

Micro Service, Distributed Crawler, Message Queue, Database

PDF Full Text Request

Related items

1	Design And Implementation Of Customized Distributed Web Crawler
2	Design And Implementation Of A Distributed Dynamic Web Crawler System
3	Distributed SP Side Short Message Gateway Base On Message Queue
4	Theory, Implementation And Application Of Distributed Message Queue
5	Research And Implementation Of Synchronization Of Database System Based On MRB
6	The Design And Implementation Of Message Queue Operation Platform In Ant Financial
7	Design And Implementation Of Top-Scholar Talents Database System Based On Distributed Crawler
8	Design And Research Of Message Transmission System Based On Message Queue
9	Research On Distributed Message Queue Based On RDMA And NVM
10	Design And Achieve Of Billing And Accounting System Message Queue