Font Size: a A A

Spam Filtering Based On Kernel Paitial Least Squares Feature Extraction

Posted on:2013-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2248330362469984Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Email is one of services be widely used on the internet, with the development of internet,a lot of spam appear, which bring a lot of trouble to the society. Therefore, how to effectivelyblock the spam become information security and information processing research of widepublic concern, and it has important theoretical significance and application value to society.At present, the spam filtering technology of content-based is one of the key research inthis area, it is a supervised learning, which is a branch of classification. Many of machinelearning methods has been applied into the field of spam filtering and achieved good results,but the data of base on the vector space model is high dimensional, sparse and entries related(synonyms) and so on, which result the ability of classification difficult, so it is necessaryreduce the dimensionality of spam data. Feature extraction is an important data dimensionreduction methods, such as principal component analysis and partial least squares and so on.PCA and PLS is proposed for the linear problem, but a lot of nonlinear problems exit, so themethod of nuclear be proposed, which is KPCA and KPLS. And they be widely used into textmining, genetic data analysis and achieved great success.PLS according striking the maximization covariance between original features andcharacteristics, dig out the inherent and hidden features from original features, and then get anew low-dimensional feature space. Kernel partial least squares introduce the kernel functionbased on partial least squares, which works well for spam dimension reduce and offsetvariable related adverse effectsBased on the research of the spam filtering technologies, the key point is focused on thefeature extraction implement on the spam filtering via using PLS and KPLS. A comparativeexperiment using the different classification algorithms (support vector machine SVM andK-nearest neighbor classification algorithm) is conducted to show the performance of PCAand KPCA on feature extraction. The email corpus used in the experiment comes fromTREC06C and Enron-Spam. By anglicizing the comparative experiment, the conclusion thatthe efficiency of spam filtering improved is draw.
Keywords/Search Tags:spam, high-dimension, kernel partial least squares, non-linear
PDF Full Text Request
Related items