Font Size: a A A

Chinese Spam Filtering Based On SVM

Posted on:2010-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:H X YuFull Text:PDF
GTID:2178330332459942Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The application of the email is more and more wide as the rapid development of Internet. The email facilitates people's life because of its rapid, convenient and low cost characters. While, the abuse of spam has brought much torment to people's life, so how to resolve the spam has become an extremely urgent problem.With the development of text classification techniques, content-based spam filtering is an effective method for spam. SVM has been widely used and achieved very good classification result in the text classification. This thesis studies mainly on the spam filtering based on content, it deeply studies the feature selection and applies the SVM into spam filtering. The main work is as follows:(1) The accuracy of word segmentation of Chinese email will impact directly on the final effect of the spam filtering system. Because email transmits by network, there are many neologisms appear on the email. The recognition degree of neologism will affect the final classification result. Aimed at this condition, in this thesis, the word segmentation methods of string matching and statistics are adopted in the spam filtering system. The string matching method is able to distinguish the words existing in machine dictionary, its speed is quick and accuracy is high. The statistics method is able to distinguish many neologisms, such as newly popular word on the network.(2) Because of the importance of feature selection in spam filtering, rational feature selection method could not only cut down the number of features and improve the computing speed, but also decrease many redundant features and improve the accuracy of the spam filtering system. In this paper, the feature selection method is studied and an improved CHI feature selection method and a union feature selection method are proposed into the mail pretreatment of spam filtering system. The result of the experiment shows that the accuracy of the spam filtering uses the improved feature selection methods is greatly improved compared with tradition feature selection methods. It fully proves that the improved methods are correct and effective.(3) Because of the peculiar advantage of SVM in solving the minor sample, higher-dimension pattern recognition and nonlinearity problem, SVM is adopted as classification in spam filtering system. Besides, in order to improve the training speed of SVM, in this paper, the LIBSVM algorithm is adopted in the process of training SVM.
Keywords/Search Tags:spam filtering, SVM, Chinese word segmentation, feature selection
PDF Full Text Request
Related items