Font Size: a A A

The Design And Implementation Of Focused Crawler On Food Security Public Sentiment News

Posted on:2016-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhuFull Text:PDF
GTID:2308330479982176Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, Food security incidents have occurred frequently, these horrible incidents have made consumers extremely disturbed. In order to hold food security problems back, and take measures in both source and transmission, one of the 12 t h five-year-plan project put forward a new way to trace food’s source and to monitor public sentiment on food security. Based on this project, our food public sentiment monitor system on food security has accomplished hot topic track, hot words recognition, positive and negative opinion identification, early warning of emergency and other function, which has improved the regulatory capacity of supervision department.As the information source model of the public sentiment monitor system, the focused crawler of this paper talking about can collect food security news form Internet rapidly and timely.After studying and summarizing the methods used in public sentiment monitor system and the algorithms applied in focused crawler at home and abroad, combining the concrete requirements of the whole system, this paper has designed and implemented a fully functional focused crawler. The main works and innovations are as follows:A thorough crawler processing chain is designed, which consist s of three main processes: information crawler, information extract ion and topic filter. In the meantime, a web platform which is used to interface with users is accomplished.In the information crawler part, based on the open-source crawler framework Heritrix, considered the properties of news source sites, we create a self-adaption crawler controller to adjust the frequency of unique site’s visit. And we use embedded database Berkeley DB and the MD5 encoder to make the increment crawling work come true. Then we optimize the web page processing chain in which site range filter and page format filter are designed.In the information extract part, first we shorten the web page processing chain to improve the speed.Intimating the HTML filter of HTMLParser, we design a new way to combine the tag Filter and attri Filter to extract precise information. What’s more, we unify the encoding of bytes stream and change traditional Chinese to simplified character.In the topic filter part, we design all site collection for food security site and topic filter collection for general site. In topic filter part, we discuss a new method of topic filter which measures the title of food security news using two- level judgement technique, which get a high topic precision rate.In addition, because of many web news have referring news source, which imply the reference web site of the news, we propose a new append function of intelligent recommendation of monitor site to expand the monitor site and information source.For now, the focused crawler this paper concerned has been unremittingly monitoring Tencent, Sina, Netease, Foodmate, Center for food safety in Hongkong and other 17 large and medium websites, the amount of all web news has been reached 500,000 in which food security news number is about 9,500, and high topic recall rate and precision rate have been achieved.
Keywords/Search Tags:Food Security, Public News Sentiment, Focused Crawler, Heritrix, Incremental Crawler
PDF Full Text Request
Related items