Design And Implementation Of Distributed Crawler System Based On Docker Cluster

Posted on:2021-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:Q Z Fang

Full Text:PDF

GTID:2428330629451027

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Nowadays,the world is in the era of rapid growth of Internet information.Common search channels,such as baidu and other search engines,can only provide us with some disorganized and superficial information,which is only displayed according to factors such as relevancy without targeted screening by algorithms.Web crawler is a common means of web information capture,but the efficiency of the common standalone web crawler system is low.Although the efficiency of the subsequent improvement using VM to make the distributed system has been improved,there is still a big gap between the speed that users really expect.Traditional crawler system is not very friendly to URL deduplication and page content deduplication,and some simple crawler systems are often restricted by the anti-crawler system of the website.In order to obtain effective information quickly,this paper designs a distributed web crawler system based on Docker container cluster.Based on scrapy-redis framework,this system USES Redis to store and crawl the URL resolved,and MongoDB to store and crawl the web content resolved,and adopts master-slave hybrid distributed system to deploy.Experiments show that this system can effectively crawl the information required by users and the speed is greatly improved compared with the distributed VM system.The work and innovations of this paper are as follows:(1)The BloomFilter deduplication algorithm is deeply studied and innovated on the basis of the original algorithm.The two-stage BloomFilter deduplication scheme is proposed to make it have a lower misjudgment rate.(2)The general crawler framework is improved to support distributed systems better.Increased the page content search module,optimized the query time in the mass data scenario,and improved the efficiency and effectiveness of the crawling work.(3)Deeply studied the common restriction measures of large websites on crawler,and made targeted treatment of anti-crawler mechanism frequently encountered in the crawling process.(4)Deeply studied the distributed cluster system based on Docker container,explored the Kubernetes cluster management platform,and deployed the distributed crawler system.

Keywords/Search Tags:

Distributed crawler, Scrapy, Docker, BloomFilter

PDF Full Text Request

Related items

1	Design And Implementation Of A Distributed Crawler System Based On Scrapy Framework
2	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
3	Design And Development Of Distributed Crawler Based On Scrapy Framework
4	Design And Implementation Of Search System Based On Scrapy-redis And GMM
5	Research And Design Of Distributed Crawler Based On Docker Cluster
6	Design And Implementation Of Distributed Books Web Crawler System
7	Design And Implementation Of Distributed Crawler Project Based On Biomedical Literature Data
8	Design And Implementation Of Web Crawler System Based On Scrapy Framework
9	Research Of Distributed Web Crawler Based On Hadoop
10	Design And Implementation Of Distributed Online Book Crawler System