Research And Implementation Of Distributed Network Crawler Based On Storm

Posted on:2019-08-29

Degree:Master

Type:Thesis

Country:China

Candidate:F Geng

Full Text:PDF

GTID:2428330626450235

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the large data age,data plays a more and more important role,how fast and accurate access to a large number of data becomes imperative.In the face of massive data,stand-alone Web crawler has many limitations,such as CPU,IO,bandwidth,crawl data performance is relatively low,it has not been able to adapt to the data crawl work of large data age,distributed Web crawler came into being.Distributed web crawler can be understood as a cluster crawler,if there is a crawl task,multiple machines running at the same time,which can be faster and more efficient to complete the task.This project combines with the current popular network literature portal website beginning Chinese network,designed and implemented a distributed web crawler based on storm,and deployed it on the Docker platform,combined with the storm and Docker of many characteristics,enhance the crawler's crawl performance,portability and scalability,and so on.Based on the knowledge of storm flow processing framework,Redis cache database,Docker container platform and Web crawler,this paper designs distributed web crawler based on Storm,which has the following characteristics: First,the Web crawler and the storm flow-type processing framework are combined to make the reptilian task parallelization,so as to improve the crawl performance of web crawler;Secondly,the crawled URL is stored into the Redis cache database,which realizes the high speed access to the URL,improves the URL reading and writing speed,and then stores the crawled data into the MySQL cluster,and introduces the MyCat database middleware to combine with MySQL.The database can be read and write separated and split horizontally,and the distributed crawler system is eventually deployed to the Docker container platform,which keeps the crawler's consistency and scalability in a cross environment.The construction of this experimental environment will use six virtual machines to build the Docker container cluster,which uses four virtual machines to build Storm container cluster,build Redis container cluster and MySQL container cluster on the remaining two virtual machines.And by setting the number of different working hosts in the Storm container cluster and the number of different worker processes,data capture tests are conducted separately.From the analysis of the test results,the design of distributed Web crawler function modules is good,which can be a high accuracy of data capture.Compared to the stand-alone Web crawler,its data capture efficiency significantly improved,and the flexibility and scalability is significantly enhanced.

Keywords/Search Tags:

Web Crawler, Storm, Redis, MySQL, Docker

PDF Full Text Request

Related items

1	Research And Implementation On The Technology Of Distributed Web Crawler Based On The Cloud Platform Of Storm
2	Design And Implementation Of Distributed Crawler System Based On Docker Cluster
3	Research And Design Of Distributed Crawler Based On Docker Cluster
4	The Design And Implementation Of The Storm-based Real-time Marketing Activity Monitoring And Anti-fraud Platform
5	Design And Implementation Of VNF Lifecycle Management System Based On Docker
6	Research And Optimization Of Dynamic Web Crawler Based On Webmagic
7	Implementation And Application Of The Algorithm Of Mining Association Rules For Stream Data Based On Storm
8	Design And Implementation Of Distributed Crawler Project Based On Biomedical Literature Data
9	Design And Implementation Of A Distributed Web Crawler System Based On Hadoop
10	The Design And Implementation Of Anti-Crawler System At Dianping