Font Size: a A A

Chinese Spam Filtering Based On Support Vector Machine And Sparse Technology

Posted on:2014-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2268330422952278Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the growing popularity of the Internet, electronic mail has been widely used in ourdaily life. However, large number of spam emails brings a lot of troubles and inconveniencein email communication. Finding an effective technology to filter spam emails has become aproblem which needs to be solved urgently in Internet information security field. And spamfiltering technology is very important in theoretical research. Currently, the spam filteringtechnology research is focused on spam filtering based on analyzing email content.Chinese spam filtering based on analyzing email content technology consist severalaspects: Chinese text segmentation, Chinese document representation, feature selection andclassification techniques. The email data is a high dimension and sparse matrix. Thus,dimensionality reduction is crucial and should be taken to improve the efficiency of spamfilter. Firstly there are several feature selection techniques such as information gain, mutualinformation and CHI methods. And we apply the Lasso theory which is a method comes fromregularization technology to email feature selection. The Lasso (Least Absolute Shrinkageand Selection Operator), which is a least squares methods with an l1norm panelizedparameter have coefficient reduction function. So we use the Lasso regression selecting theimportant features.Support vector machines has widely used for text classification and spam filtering. Theresearch on support vector machine, especially the study on the kernel function is a researchhotspot in machine learning filed. Generally, there are several commonly used support vectormachine kernel functions: linear kernel function, polynomial kernel function, radial basisfunction (Gaussian kernel) and the perception kernel function. Q-Gaussian function is ageneralized Gaussian function with parameter Q. And the Q-Gaussian function has manyspecialties, which the Gaussian function does not own. With depth study on the Gaussianfunction, we introduce the Q-Gaussian function into the SVM kernel function and we applythe SVM based on Q-Gaussian kernel function on spam filtering. A series of experiment areconducted on two widely used Chinese email corpus trec06c and CDSCE. The experimentresults show that the Q-Gaussian kernel have a better robustness than the liner kernel functionand Gaussian kernel function. And the Q-Gaussian kernel function filtering obtains a verygood performance on spam filtering.In fact, there are very different costs on spam misclassification and ham misclassification.And the spam emails always take the major proportion in reality. The cost sensitive supportvector machine has strong robustness when dealing with the unbalance email data. Thus, itbecomes a research hotspot. For the different misclassification cost, the cost sensitive support vector machine use different error costs to build the filter. In this paper, we proposed a newmethod for spam filtering by using the cost sensitive SVM (CSSVM) based on the libsvm,developed by Lin. We use the new spam filter algorithm to improve the accuracy andgeneralization ability of mail filtering algorithm.In this paper, we improve the feature selection method and SVM classification in spamfiltering. The feature selection base on Lasso and Q-Gaussian kernel SVM are proposed. Weapply the two methods and cost sensitive SVM on the spam classification. The experimentsconducted on spam corpus show the effectiveness of those new methods.
Keywords/Search Tags:Lasso method, Q-Gaussian kernel function, Cost-sensitive, Support vectormachines
PDF Full Text Request
Related items