Font Size: a A A

Internet Sensitive Information Identification Based On Semi-Supervised Learning

Posted on:2013-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2268330392970612Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, people depend more and more on the net work to access and release information. Internet can storage and transmission a lot of the information which has a great effect.But it also hides a huge security threats. The criminals make use of the features of free interactive of the Internet and spread several of speech of damaging the social harmony, and these content called sensitive informa-tion. In case of the speech spread out, it tend to cause extremely bad influence, and br-ings the enormous pressure from public opinion and economic losses. Therefore, it is necessary to identify objectionable internet information accurately and timely.The propagation speed of the sensitive information is very fast. Traditional mac-hine learning method is facing a serious problem, that is, unable to spend a lot of time to sample labeling. We can only use a small number of labeled samples to train the cl-assifier with the help of the multiple quantities of labeled samples.Sensitive information only take a small part in the network public opinion. In the collected samples, general public opinion information dominated the most of them. If training classifier by these data, classification result is inevitably inclined to the type of the large numbers. To solve this problem, over-sampling can increase the number of the fewer one, which can led to a better performance of the classifier.This paper uses text classification method to solve the problem of identifying the sensitive information in the Internet. The sensitive information shows the features of fast spreading speed, bad influence and low number. In the following text, various m-ethods are adopted to solve the above problem. A method is proposed which combine-d the semi-supervised machine learning with the over-sampling meanwhile improved the traditional SMOTE algorithm. Experimental results show that improved algorithm can effectively improve the performance of classifier.
Keywords/Search Tags:Sensitive Information Identification, Semi-SupervisedLearning, Imbalance Data, SMOTE
PDF Full Text Request
Related items