Font Size: a A A

Research And Implementation Of Distributed Network Crawler Based On Storm

Posted on:2019-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:F GengFull Text:PDF
GTID:2428330626450235Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the large data age,data plays a more and more important role,how fast and accurate access to a large number of data becomes imperative.In the face of massive data,stand-alone Web crawler has many limitations,such as CPU,IO,bandwidth,crawl data performance is relatively low,it has not been able to adapt to the data crawl work of large data age,distributed Web crawler came into being.Distributed web crawler can be understood as a cluster crawler,if there is a crawl task,multiple machines running at the same time,which can be faster and more efficient to complete the task.This project combines with the current popular network literature portal website beginning Chinese network,designed and implemented a distributed web crawler based on storm,and deployed it on the Docker platform,combined with the storm and Docker of many characteristics,enhance the crawler's crawl performance,portability and scalability,and so on.Based on the knowledge of storm flow processing framework,Redis cache database,Docker container platform and Web crawler,this paper designs distributed web crawler based on Storm,which has the following characteristics: First,the Web crawler and the storm flow-type processing framework are combined to make the reptilian task parallelization,so as to improve the crawl performance of web crawler;Secondly,the crawled URL is stored into the Redis cache database,which realizes the high speed access to the URL,improves the URL reading and writing speed,and then stores the crawled data into the MySQL cluster,and introduces the MyCat database middleware to combine with MySQL.The database can be read and write separated and split horizontally,and the distributed crawler system is eventually deployed to the Docker container platform,which keeps the crawler's consistency and scalability in a cross environment.The construction of this experimental environment will use six virtual machines to build the Docker container cluster,which uses four virtual machines to build Storm container cluster,build Redis container cluster and MySQL container cluster on the remaining two virtual machines.And by setting the number of different working hosts in the Storm container cluster and the number of different worker processes,data capture tests are conducted separately.From the analysis of the test results,the design of distributed Web crawler function modules is good,which can be a high accuracy of data capture.Compared to the stand-alone Web crawler,its data capture efficiency significantly improved,and the flexibility and scalability is significantly enhanced.
Keywords/Search Tags:Web Crawler, Storm, Redis, MySQL, Docker
PDF Full Text Request
Related items