Distributed Incremental Acquisition Method For Dynamic Network Data

Posted on:2018-12-31

Degree:Master

Type:Thesis

Country:China

Candidate:Y Cao

Full Text:PDF

GTID:2348330518996842

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the explosive development of the internet, Web has become a large information service website with stations all over the world. Now more and more people has begun to input vigor and time into the internet.E-business, video, forum, and microblog are sources to obtain important data. Whether you want to obtain the latest data in time for research or to obtain plenty of information in bulk, you will need an appropriate and efficient way to support the data extraction.As one of the technologies to retrieve data on a large scale, web crawler technology has been noticed by people again. While using the technology to retrieve plenty of data online, people has also been dedicated to improving it.After studies on web crawler technology, distributed technology,Docker technology, and Linux related technology, this thesis has completed the following tasks:Firstly, the thesis proposes the design thinking of pragmatic web crawler, and expounds the design and implementation method of information extraction module and data memory module.Then the thesis builds the distributed crawler cluster of Pyspider which is based on distributed crawler frame to replace manual distributed crawler. The thesis also solves some Linux server related problems.Later, in order to improve the operation efficiency of the distributed crawler frame PySpider, the thesis designs and realizes a method to build distributed crawler cluster with Docker. And the thesis conducts test on the crawler efficiency before and after the improvement. And corresponding analysis is also conducted on the test results.Finally, in order to meet the practical work demand in the project,the thesis combines incremental crawler and distributed crawler frame PySpider, and eventually realizes the method to gain distributed increment based upon dynamic internet data.In the last part, the thesis concludes the current work and proposes expectations and plans for the future.

Keywords/Search Tags:

spider, pyspider, distributed, docker

PDF Full Text Request

Related items

1	Research And Design Of Distributed Crawler Based On Docker Cluster
2	Design And Implementation Of Distributed Crawler System Based On Docker Cluster
3	The Design And Implementation Of Distributed Data Acquisition System In IPTV Based On Docker
4	Research And Implementation Of Distributed Web Platform Based On Docker
5	Design And Implementation Of Resource Allocation And Container Cluster Management System Based On Docker
6	Design And Research Of Network Spider
7	Research And Application Of Full-Text Search Engine Based On Docker Technology
8	Research On Scheduling Strategy And Communication Optimization Technology Of Distributed Docker Cluster
9	Research And Implementation Of High Availability Platform Based On Docker Swarm Cluster
10	The Design And Implementation Of A Virtualized Application Platform Based On Docker