Font Size: a A A

Design And Implementation Of Directional Crawler System Based On Na(?)ve Bayes Algorithm

Posted on:2017-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:H K ZhangFull Text:PDF
GTID:2358330482991357Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the popularity of the rapid development of computer technology and computer network, Internet has become an important way to all kinds of media to disseminate information, and to express their views. According to the 37 th "China Internet Development Statistics Report" published in 2016, as of December 2015, the number of Internet users has reached 688 million, Internet penetration rate reached 50.3%, compared with 2014 it improved by 2.4 percentage points. It shows the importance of the Internet in people's daily lives is rising. A Internet event that many people concerned will form an Internet public opinion when it widely disseminated, in the face of massive Internet information, how to collect the data and found the public opinion and the public opinion hotspots analysis and early warning now become a problem. As a result, the directional crawler as an important data discovery acquisition mode has been widespread concerned.In this paper, the author studied and researched of t several mainstream information collection system, and read a lot of academic papers on data collection. The current mainstream crawler system is mainly divided into three categories: traditional crawler, topic crawler and directional theme crawler. Traditional crawler as the earliest crawler system, is the basis of the other two crawler technologies; On the basis of traditional crawlers system, the topic crawler system join some specific algorithms to optimize the crawling scope and strategy.Although it improved the data collection to a certain extent on accuracy, the intelligence of algorithm itself is limited. The results of the topic crawler system may not be used for the high precision data field; directional crawler system only works on specific pages for data acquisition, and it matches the target information with the template integrated in the system. The directional crawler system has a better precision than topic crawler in theory. but due to the difficulties of building the complex regular expression, it is not easy to achieve.With the analysis on the mainstream crawler systems above, this paper constructs a directional crawler system based on Na(?)ve Bayes classifier. And the following improvements are made:(1) Using regular expression and XPath to collect data. Since regular expressions has a wider range than XPath, but more noise data. However, the XPath has stringent requirements on html structure. So, this system combined these two methods to learn each other.(2) We join the Na(?)ve Bayes classifier In the collection of data to reduce the noise and to improve the accuracy rate of this system.The system is developed by C# language in development environment VS2013 and database SQL Server 2008.This system can collect these websites include Tencent news, Sina news, Tencent News, Sohu News, Wangxinwang and so on. Compared to traditional crawler system, this system is more accurate and reduces the useless information for users, so that the system is more effective and more accurate and has less spam. Compared to traditional crawler system, this system is more accurate and reduces the useless information for users, so that the system is more effective and more accurate and has less spam.This system is mainly for the development of four modules, each system login module, data acquisition module, data processing module and data storage module. In this paper, we introduced the four modules in detail. The core module of the system is data acquisition module and data processing module, data acquisition module is to achieve the target site data acquisition first, and the data processing module collects data that match the filter. After the screening of the data processing module, data storage module stores the data into the database.
Keywords/Search Tags:Data Acquisition, XPath, Regex, Na(?)ve Bayes classifier
PDF Full Text Request
Related items