Font Size: a A A

Design And Application Of Distributed Crawler System Based On Micro-Service Architecture

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GeFull Text:PDF
GTID:2428330614465935Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of social economy,Internet applications have become a part of people's daily life,and it is becoming more and more important to obtain network application data in a timely and accurate manner.As a computer technology that meets the specific data needs of users through accurate information capture of Internet data,web crawlers also face huge opportunities and changes.Under such technological background,early web crawler technology has been already unable to assume this task.The thesis analyzes the deficiencies of existing crawler systems in technical implementation,and designs a distributed crawler system based on microservice architecture based on the massive page data crawling requirements which use Spring Cloud framework,this crawler system achieves architectural isolation between different microservice modules;message middleware or remote procedure call is used for data communication between different microservices to improve the availability of distributed crawler systems.The main work of this thesis includes designing and implementing a globally unique ID generation algorithm in a distributed system environment;URL double check based on Bloomfilter and Redis;client load balance algorithm strategy in distributed scenario;the limit algorithm strategy to deal with massive requests in the distributed scenario;use multi thread crawling based on the thread pool and dynamic proxy pool technology to deal with anti-crawler to improve the crawling efficiency and success rate of the crawler;the page resolution microservice uses CSS selector based custom page data extraction and consumer end prevention of repeated consumption mechanisms based Redis;using database replication and data sharding based Mongo DB to deal with massive data storage scenarios,Redis sentinel cluster and persistent storage to ensure high availability.Through experimental testing,the distributed crawler system based on the micro-services architecture can not only carry the crawler requests of a large number of users,but also meet the different data extraction requirements from the users.At the same time,the maintainability and expandability of the system are stronger than the traditional crawler system,which meets the system design requirements.
Keywords/Search Tags:Micro Service, Distributed Crawler, Message Queue, Database
PDF Full Text Request
Related items