Font Size: a A A

Research And Design Of Distributed Data Acquisition System Based On Cloud Platform

Posted on:2021-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:H Q WangFull Text:PDF
GTID:2428330620464203Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the progress of the times and the continuous development of Internet technology,the Internet has become the main way for everyone to obtain information.At the same time,Internet data is growing at a geometric level.How to efficiently and quickly obtain the content we are interested in is worthy of study.At present,there are a lot of similar software in the formed products and open source world.Based on learning from their advantages and improving their shortcomings,based on the laboratory cloud platform,an efficient distributed data acquisition system has been constructed,which uses rich cloud resources to implement Large-scale network data collection.First,in this article,for the actual data collection scenario,combined with the advantages and disadvantages of the existing crawler framework,the overall architecture of a distributed data collection system is proposed,and it is divided into three parts: web management,server,and collection..Users can flexibly manage the collection tasks and collection nodes through the operation interface provided by the web management terminal.In order to reduce the threshold of the use of the collection system in the server,the user has integrated the custom collection template function.The user can either customize the collection template or You can use built-in templates.At the same time,in response to a large number of domain name resolution requests during the page download process,an efficient DNS cache system was implemented in the server to optimize the domain name resolution process.To address the shortcomings of the existing Bloom filter in the process of URL deduplication,we implemented a parallel Bloom filter to reduce the false positive rate;the collection end is the node that implements the page collection work.For the large number of existing websites,In the form of downloading middleware,the system integrates powerful anti-crawling modules,such as dynamic IP proxy middleware.Secondly,based on the above design,use Python as the basic development language,and implement the system with MVC design pattern.The information transmission between the control end and each collection node is accomplished through socket technology;the web page download module combines multi-thread technology to achieve high concurrent downloads.The page parsing module combines the features of the lxml library to implement a variety of extraction expression syntax support,including xpath,css,regular expressions;the data storage module can save the extracted page data as files(Excel,json,etc.),It can also be saved directly to the database.In this article,interface implementation is provided for the two storage methods.Finally,the acquisition system was deployed on the cloud platform of the laboratory,and the actual website was used as the target for detailed testing.Finally,the experimental results show that the system meets expectations and the acquisition effect is good.
Keywords/Search Tags:cloud platform, web crawler, distributed data collection
PDF Full Text Request
Related items