Font Size: a A A

The Design And Implementation Of Network Endanger Source Filtering And Detect Tracking System

Posted on:2014-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:C JiangFull Text:PDF
GTID:2248330398950531Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, Internet information has many features, such as massive data, huge number of users, as well as real-time. It is more and more important to filter and track the Internet information, to detect the endanger information through the Internet. This article’s main idea is automatically collecting the Internet information, using SVM which is based on machine learning to intelligent classification the information, then finding the source of information, enabling the tracking of the endanger source.This paper use the Heritrix which is open source, to get the Internet information automatically. As the Internet information is updating rapidly, to detect harm source need to constantly get new information. In response to this demand, changing some codes, to optimize and accelerate website during crawling, able to achieve incremental crawl the site.To process the information intelligently, the paper use support vector machine which is based on statistical theory. There are two major categories of classification techniques, based on semantic and based on machine learning. A lot of literature shows that support vector machine’s classification performance is really good. First to pretreated the information, including text segmentation, feature extraction, establish document feature vector. After pretreatment, training with a small part of documents, to predict and classificate the all, using libsvm which is open source. Then, using post time to find out the source during the classification result.This article use Heritrix crawler framework to crawler111news and information Web site, collecting and processing the data of it, which includes the People’s Daily, sina, sohu and other news portal site. Ultimately the experiment can successfully trace back to the network transmission source.Due to the rapid development of network, the retrospective of the network hazard source should be closely combined with the new technology and new algorithm results shows that the system can effectively filter and detect tracking the hazard source of network information dissemination.
Keywords/Search Tags:Information crawling, Heritrix, Text classification, Support vector machine, Endanger source tracking
PDF Full Text Request
Related items