Font Size: a A A

Design And Implementation Of Distributed Crawler System Based On Docker Cluster

Posted on:2021-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z FangFull Text:PDF
GTID:2428330629451027Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Nowadays,the world is in the era of rapid growth of Internet information.Common search channels,such as baidu and other search engines,can only provide us with some disorganized and superficial information,which is only displayed according to factors such as relevancy without targeted screening by algorithms.Web crawler is a common means of web information capture,but the efficiency of the common standalone web crawler system is low.Although the efficiency of the subsequent improvement using VM to make the distributed system has been improved,there is still a big gap between the speed that users really expect.Traditional crawler system is not very friendly to URL deduplication and page content deduplication,and some simple crawler systems are often restricted by the anti-crawler system of the website.In order to obtain effective information quickly,this paper designs a distributed web crawler system based on Docker container cluster.Based on scrapy-redis framework,this system USES Redis to store and crawl the URL resolved,and MongoDB to store and crawl the web content resolved,and adopts master-slave hybrid distributed system to deploy.Experiments show that this system can effectively crawl the information required by users and the speed is greatly improved compared with the distributed VM system.The work and innovations of this paper are as follows:(1)The BloomFilter deduplication algorithm is deeply studied and innovated on the basis of the original algorithm.The two-stage BloomFilter deduplication scheme is proposed to make it have a lower misjudgment rate.(2)The general crawler framework is improved to support distributed systems better.Increased the page content search module,optimized the query time in the mass data scenario,and improved the efficiency and effectiveness of the crawling work.(3)Deeply studied the common restriction measures of large websites on crawler,and made targeted treatment of anti-crawler mechanism frequently encountered in the crawling process.(4)Deeply studied the distributed cluster system based on Docker container,explored the Kubernetes cluster management platform,and deployed the distributed crawler system.
Keywords/Search Tags:Distributed crawler, Scrapy, Docker, BloomFilter
PDF Full Text Request
Related items