Research And Application Of Distributed Wechat Public Platform Web Crawler System

Posted on:2016-12-23

Degree:Master

Type:Thesis

Country:China

Candidate:L Wu

Full Text:PDF

GTID:2308330464461218

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years, the rapid development of Mobile Internet has a great impact on people’s way of life. As a new social platform, Wechat has accumulated more than 600 million registered users. And the Wechat Public Platform also quickly became popular. It has already accumulated more than 8 million certified accounts and published more than 200 million articles. It is of great significance to do research on this huge information resources.This paper designed and developed a distributed web crawler system for Wechat Public Platform base on the open source web crawler framework Scrapy. The public account and article informations which crawled from Wechat Public Platform were stored into the single-node My SQL database and the Fast DFS distributed filesystem. And then illustrated briefly an application scenario which provide the data service for the public users to set up a website quickly based on crawled data.First of all, do thorough research on the overall architecture, each component module and the internal mechanism of Scrapy framework. Against its drawback of single crawler node supported only by far, extending it by developing a new scheduler module base on an ordered set of Redis database to make it support multiple crawlers of master-slave mode.Secondly, base on the improved Scrapy framework mentioned above, determining the crawl procedure and strategy of the crawler by analyzing the structure and characteristics of the webpage on Wechat Public Platform, and then developing the crawler module. Determining the store principle that the string data, which are relatively small in size, are stored into My SQL database and the file data, which are relatively large in size, are stored into the Fast DFS distributed filesystem. Implementing the incremental crawler by designing a time record queue of the latest crawl action.Once again, base on the design above, finished the program codes writing, testing and deploying. The running results show that the multiple crawler node work well with each other, and load balancing work between each node. It achieved the expected design goal basically.Since then, illustrate an application scenario based on the crawled data briefly. Designing a server-side program to provide data service for the public users to set up a website quickly.Finally, summarizing the work done by far and introducing the potential improvements of this system as the entry point for further research.

Keywords/Search Tags:

Wechat Public Platform, Distributed Crawler, Scrapy, Fast DFS

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Web Crawler System Based On Scrapy
2	Design And Development Of Distributed Crawler Based On Scrapy Framework
3	Design And Implementation Of A Distributed Crawler System Based On Scrapy Framework
4	Design And Implementation Of Distributed Books Web Crawler System
5	Design And Implementation Of Distributed Crawler Project Based On Biomedical Literature Data
6	Design And Implementation Of Web Crawler System Based On Scrapy Framework
7	The Development And Implementation Of Patient Medical Service System Based On WeChat Platform
8	Design And Implementation Of Distributed Crawler System Based On Docker Cluster
9	Design And Implementation Of Distributed Online Book Crawler System
10	Scrapy Framework-based Web Crawler Achieved Data Capture And Analysis