Font Size: a A A

Design And Implementation Of Distributed Netnews Crawling System Based On Scrapy

Posted on:2016-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:L S MaFull Text:PDF
GTID:2348330488974530Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the basic way of our daily life has been quietly changed. The Internet has become the cheapest and most efficient way to disseminate information and to exchange material. News report is one of the most important ways of information acquisition in our daily life. With the application and rapid development of network technology, not only has news media evolved into a new media merging by traditional media and Internet media, but also public access to getting news and information is growing continually. The time delay of the network news has shrunk so that more social groups began to get news and information from the Internet. Meanwhile, researching on the big data frontier of the network news is becoming popular currently. From the intuitive point of view, the demand of network news data is increasing in the field of scientific research. In response, a distributed web crawler system is designed and implemented to extract the network news data in this paper, which provides sufficient support for our relevant research.Based on the research topic, this paper introduced the generation, development and operational principle of the web crawler, as well as the structure and working flow of Scrapy framework, composition and function of each component of Scrapy-Redis, and concepts related to Graphite. This paper deeply analyzed the main characteristics of the crawler for network news, designed crawling strategy and extraction fields according to characteristics of webpages. Firstly, the system adopted Scrapy as the basic framework and deployed a custom download middleware to avoid being blocked by webs. To improve the efficiency of data crawling, it used Redis database to deploy distributed crawler with master-slave structure, using Graphite as monitoring tool to realize visualization of system state. Then it used Selenium to solve the problem of dynamic web data extraction. We have also designed the data processing module, whose functions mainly include data cleaning, transcoding, adding object and data classification, etc. In order to test the performance of the system, we chose four major news columns of the Tencent as the target, including domestic news, international news, social news and military news. After running 10 hours, more than 30,000 contents of news and millions of comments were crawled. Finally, through three experiments of fundamental data analysis and three aspects of content of the news, network media and user comments, we analyzed six features related to network news including the focus of public opinion, news features of time dimension, user browsing preferences, network media influence, user's gender characteristics and user's regional characteristics. The objectivity, accuracy and multi characteristics of the data were verified.
Keywords/Search Tags:network news, distributed crawler, data processing, data analysis
PDF Full Text Request
Related items