Font Size: a A A

Research And Design Of Distributed Crawler Based On Docker Cluster

Posted on:2018-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:W L LiFull Text:PDF
GTID:2348330512479809Subject:Engineering
Abstract/Summary:PDF Full Text Request
Since the government put forward the implementation of national data strategy,the status of Internet big data as important strategic resources has become more and more obvious.Meanwhile,the Web crawler as an effective tool to exploit Internet big data has been also more and more important.However,the traditional crawler algorithms are built on the VM cluster,which have the weakness of the insufficient utilization of host resource and the difficulty for crawler system expansion.The development of emerging virtualization technology Docker provides an opportunity of solving the problem of network crawler running on the VM environment.Distributed crawler based on Docker cluster are mainly researched from two aspects:distributed crawler technology and Docker cluster technology.The open-source crawler framework supports distributed design differently currently.For example,the Scrapy crawler framework does not support distributed,and the existing framework is more suitable for running on VM cluster environment,which exists the weakness of insufficient utilization of system resource.Docker cluster is a new virtualized cluster technology,which can use the various host resources more rational and efficient than the VM cluster.By studying the mature Web crawler architecture,this paper implements a crawler system supported distributed fully and runs it on the Docker cluster.The improved K-type Bloom filter algorithm which has better dereplication effect and meets the needs of distributed applications is also realized to improve the system.The main work of this paper is shown as follows:Firstly,In-depth study the working principle of Web crawler and grasp its overall architecture design patterns,In-depth study the scheduling management tools and master its working principle and the mechanism of management and scheduling.The research content is deduplication algorithm and applied to distributed crawler system.Secondly,understand the reason why it does not support distributed by studying the original web crawler framework,and implement the modules of distributed crawler which is suitable for Docker cluster.The complete and efficient distributed crawler system is formed by the effective combination of the various modules,and making it run successfully on Docker cluster by using the scheduling management tool Kubernetes of emerging Docker cluster to manage various functional modules of the distributed crawler system.Thirdly,various experiments of the designed distributed crawler built on the traditional VM cluster and the Docker cluster respectively were carried out to prove that the distributed crawler system has better crawling efficiency and efficient utilization of host resources and excellent system expansibility.Finally,the improved K-type Bloom filter algorithm was designed to meet the needs of distributed applications and to further improve the dereplication effect by studying the principle of classical Bloom filter algorithm and its error probability.Finally,the improved K-type Bloom filter is proved to have better dereplication effect.
Keywords/Search Tags:Docker, Distributed crawler, Kubernetes, Bloom filter
PDF Full Text Request
Related items