Font Size: a A A

Research And Application Of Distributed Wechat Public Platform Web Crawler System

Posted on:2016-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:L WuFull Text:PDF
GTID:2308330464461218Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years, the rapid development of Mobile Internet has a great impact on people’s way of life. As a new social platform, Wechat has accumulated more than 600 million registered users. And the Wechat Public Platform also quickly became popular. It has already accumulated more than 8 million certified accounts and published more than 200 million articles. It is of great significance to do research on this huge information resources.This paper designed and developed a distributed web crawler system for Wechat Public Platform base on the open source web crawler framework Scrapy. The public account and article informations which crawled from Wechat Public Platform were stored into the single-node My SQL database and the Fast DFS distributed filesystem. And then illustrated briefly an application scenario which provide the data service for the public users to set up a website quickly based on crawled data.First of all, do thorough research on the overall architecture, each component module and the internal mechanism of Scrapy framework. Against its drawback of single crawler node supported only by far, extending it by developing a new scheduler module base on an ordered set of Redis database to make it support multiple crawlers of master-slave mode.Secondly, base on the improved Scrapy framework mentioned above, determining the crawl procedure and strategy of the crawler by analyzing the structure and characteristics of the webpage on Wechat Public Platform, and then developing the crawler module. Determining the store principle that the string data, which are relatively small in size, are stored into My SQL database and the file data, which are relatively large in size, are stored into the Fast DFS distributed filesystem. Implementing the incremental crawler by designing a time record queue of the latest crawl action.Once again, base on the design above, finished the program codes writing, testing and deploying. The running results show that the multiple crawler node work well with each other, and load balancing work between each node. It achieved the expected design goal basically.Since then, illustrate an application scenario based on the crawled data briefly. Designing a server-side program to provide data service for the public users to set up a website quickly.Finally, summarizing the work done by far and introducing the potential improvements of this system as the entry point for further research.
Keywords/Search Tags:Wechat Public Platform, Distributed Crawler, Scrapy, Fast DFS
PDF Full Text Request
Related items