Font Size: a A A

Distributed Incremental Acquisition Method For Dynamic Network Data

Posted on:2018-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2348330518996842Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the explosive development of the internet, Web has become a large information service website with stations all over the world. Now more and more people has begun to input vigor and time into the internet.E-business, video, forum, and microblog are sources to obtain important data. Whether you want to obtain the latest data in time for research or to obtain plenty of information in bulk, you will need an appropriate and efficient way to support the data extraction.As one of the technologies to retrieve data on a large scale, web crawler technology has been noticed by people again. While using the technology to retrieve plenty of data online, people has also been dedicated to improving it.After studies on web crawler technology, distributed technology,Docker technology, and Linux related technology, this thesis has completed the following tasks:Firstly, the thesis proposes the design thinking of pragmatic web crawler, and expounds the design and implementation method of information extraction module and data memory module.Then the thesis builds the distributed crawler cluster of Pyspider which is based on distributed crawler frame to replace manual distributed crawler. The thesis also solves some Linux server related problems.Later, in order to improve the operation efficiency of the distributed crawler frame PySpider, the thesis designs and realizes a method to build distributed crawler cluster with Docker. And the thesis conducts test on the crawler efficiency before and after the improvement. And corresponding analysis is also conducted on the test results.Finally, in order to meet the practical work demand in the project,the thesis combines incremental crawler and distributed crawler frame PySpider, and eventually realizes the method to gain distributed increment based upon dynamic internet data.In the last part, the thesis concludes the current work and proposes expectations and plans for the future.
Keywords/Search Tags:spider, pyspider, distributed, docker
PDF Full Text Request
Related items