Font Size: a A A

Design And Implementation Of Distributed And Automatic Crawler Based On Redis

Posted on:2019-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:S ZengFull Text:PDF
GTID:2428330566995781Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The 21 st century is an era of data explosion.The data generated daily is difficult to count and has a great variety.The information of data is time-intensive information,in the daily mass and variety of information and information in front of enterprises and those who have such data needs of people looking to get similar information website of the most valuable web information,not only unified data specification,large amount of data,timeliness,and hope to obtain the low cost and high efficiency.Unlike traditional crawlers,redis-based distributed automation crawlers focus on quick crawling of news articles and blog posts.Automated parsing allows you to save time by creating web parsing scripts for your website.According to the general work flow of reptiles,the entire reptile system is designed into various functional independent modular architecture,which is divided into four modules,namely scheduling module,download module,parsing module,storage module,intermediate data flow through the use of middleware Redis provide natural distributed queue to flow.In the scheduling module,various crawler crawling strategies,such as periodic crawling strategy,error retry crawling strategy,breakpoint crawling strategy,real-time crawling strategy,deduplication strategy,grasping strategy Take rate and concurrency control of crawling strategy.The parsing module is divided into two sub-modules,the list page auto-parsing module and the detail page auto-parsing module,so as to realize the overall parsing of the website.In addition,the redis-based distributed modular reptile architecture designed to support multi-machine deployment,massive data crawling.Based on the common crawler's realization principle,workflow,crawl strategy,web page text extraction and distributed related methods and techniques,the final realization of an information-based Web site automation analysis,and efficient support for a variety of crawler strategy distribution Modular reptile system.Experiments through its own crawl rate test,compared with the Scrapy crawl rate test,automated analytical test,the results show that the redis-based distributed automation crawler in architecture and practical aspects are feasible.
Keywords/Search Tags:Distributed, Crawler, Automatic
PDF Full Text Request
Related items