Font Size: a A A

Application Research Of Network Information Filtering Model Based On The Content

Posted on:2010-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:J DuFull Text:PDF
GTID:2178360278957709Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The increasingly abundant internet resources bring convenience to people, promote the development of the social. On the other hand, the confused resources of information increase the difficulty when the user find useful information, and the freedom of internet information provides the way for the transmission of illegal information. Therefore, the technique how to help people to select the useful information effectively and filters irrelevant and illegal information is attracting more and more attention. The technique of information filtering based on the content provides the effective way for solving this problem.It is described in this thesis that the basic problem for information filtering based on the complete internet information filtering model, including the elementary principle, general flow, related model and performance evaluation. It researches the text information filtering model and related technique based on the content, including automatic Chinese word division, text feature extraction, user template indication and text classification so on.According to the algorithms of various text classification, the study of the thesis focuses on the Support Vector Machine (SVM). It is analyzed that the common classification problem for imbalance data set in the pattern recognition and demonstrates the tendentiousness for the traditional SVM algorithm in predicting at imbalance data set. In the meantime, it put forward the action plan. The plan which bases on the method of generating sample by using clustering and genetic cross can construct the virtual samples, reduces the imbalance for sample space and raise the classification accuracy. In addition, according the theory of pattern recognition, the sample that locates at the boundary might cause overfitting for SVM and affect generalization ability. The SVM is improved by combining KNN thought and properly cuts the boundary sample. This method decreases the computing load in training, reduces the time of training and classification. This improved algorithm is proved effectively through the experiment finally.For the specific application in internet harmful text information filtering, this thesis constructs the text information filtering model based on the improved SVM and analyzes the common problems in the existed harmful information filtering system. Firstly, because the sample includes harmful information is difficultly to obtain, the amount of training sample is not enough. Secondly, It is analyzed that the constructive principles of dictionary composed by harmful word in this thesis. It presents the new composite kernel function according the theory of SVM composite kernel function and models for the standalone harmful word and word composite in the text. The result of experiment shows that this kernel function can apply to the harmful information filtering and providing better performance. A basic harmful information filtering system prototype is structured based on this filtering model at last.
Keywords/Search Tags:information filtering model, text filtering, support vector machine, kernel function, imbalanced data sets, sample generat
PDF Full Text Request
Related items